JP Journal of Biostatistics

The JP Journal of Biostatistics is a highly regarded open-access international journal indexed in the Emerging Sources Citation Index (ESCI). It focuses on the application of statistical theory and methods in resolving problems in biological, biomedical, and agricultural sciences. The journal encourages the submission of experimental papers that employ relevant algorithms and also welcomes survey articles in the fields of biostatistics and epidemiology.

Submit Article

LEVERAGING ON HYBRID MACHINE LEARNING MODELS FOR EARLY BREAST CANCER DETECTION

Authors

  • Gideon Nyakundi
  • John Ndiritu
  • Joseph Mwaniki
  • Timothy Kamanu

Keywords:

breast cancer, LightGBM, principal component analysis, Borderline-SMOTE

DOI:

https://doi.org/10.17654/0973514326002

Abstract

Breast cancer is among the most common cancers in women worldwide, and outcomes improve with early detection. As machine learning enters routine care, data driven diagnostic systems may support earlier risk estimation. We present a compact pipeline that uses Principal Component Analysis for dimensionality reduction and Borderline-SMOTE for imbalance correction, followed by classification with Light Gradient Boosting Machine. Using the standardized Wisconsin Breast Cancer Diagnostic dataset, we retain 20 features to capture key variance while limiting redundancy and noise. Borderline-SMOTE is applied within each training fold to refine class boundaries. Performance is evaluated with stratified 10‑fold cross validation and compared with seven alternatives: XGBoost, Support Vector Machines, Random Forests, Logistic Regression, Gaussian Naive Bayes, k Nearest Neighbor, and a Multilayer Perceptron. With 20 components, the proposed model attains accuracy 0.993, precision 1, recall 0.986, F1 0.993, and AUC 1.000 for distinguishing benign from malignant cases, outperforming baselines. These findings suggest that coupling dimensionality reduction, boundary focused resampling, and gradient boosted trees can enhance diagnostic performance and may inform clinical decision support.

Received: October 25, 2025
Accepted: December 8, 2025

Author Biography

Timothy Kamanu

Lecturer Department of Mathematics

References

[1] H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA: A Cancer Journal for Clinicians, vol. 71, no. 3, pp. 209–249, 2021, doi: 10.3322/caac.21660.

[2] A. Q. Khan et al., “Advances in breast cancer diagnosis: a comprehensive review of imaging, biosensors, and emerging wearable technologies,” Front Oncol, vol. 15, p. 1587517, June 2025, doi: 10.3389/fonc.2025.1587517.

[3] Y.-J. Qi et al., “Radiomics in breast cancer: Current advances and future directions,” Cell Reports Medicine, vol. 5, no. 9, p. 101719, Sept. 2024, doi: 10.1016/j.xcrm.2024.101719.

[4] L. Quinn et al., “Interobserver variability studies in diagnostic imaging: a methodological systematic review,” Br J Radiol, vol. 96, no. 1148, p. 20220972, Aug. 2023, doi: 10.1259/bjr.20220972.

[5] K. Puttegowda et al., “Enhanced machine learning models for accurate breast cancer mammogram classification,” Global Transitions, vol. 7, pp. 276–295, Jan. 2025, doi: 10.1016/j.glt.2025.04.007.

[6] A. Khalid et al., “Breast Cancer Detection and Prevention Using Machine Learning,” Diagnostics (Basel), vol. 13, no. 19, p. 3113, Oct. 2023, doi: 10.3390/diagnostics13193113.

[7] K. Fujiwara, “Knowledge distillation with resampling for imbalanced data classification: Enhancing predictive performance and explainability stability,” Results in Engineering, vol. 24, p. 103406, Dec. 2024, doi: 10.1016/j.rineng.2024.103406.

[8] J. L. Cross, M. A. Choma, and J. A. Onofrey, “Bias in medical AI: Implications for clinical decision-making,” PLOS Digit Health, vol. 3, no. 11, p. e0000651, Nov. 2024, doi: 10.1371/journal.pdig.0000651.

[9] B. F. Azevedo, A. M. A. C. Rocha, and A. I. Pereira, “Hybrid approaches to optimization and machine learning methods: a systematic literature review,” Mach Learn, vol. 113, no. 7, pp. 4055–4097, July 2024, doi: 10.1007/s10994-023-06467-x.

[10] Y. Amethiya, P. Pipariya, S. Patel, and M. Shah, “Comparative analysis of breast cancer detection using machine learning and biosensors,” Intelligent Medicine, vol. 2, no. 2, pp. 69–81, May 2022, doi: 10.1016/j.imed.2021.08.004.

[11] K. Adem, “Diagnosis of breast cancer with Stacked autoencoder and Subspace kNN,” Physica A: Statistical Mechanics and its Applications, vol. 551, p. 124591, Aug. 2020, doi: 10.1016/j.physa.2020.124591.

[12] G. Menon, F. M. Alkabban, and T. Ferguson, “Breast Cancer,” in StatPearls, Treasure Island (FL): StatPearls Publishing, 2025. Accessed: July 16, 2025. [Online]. Available: http://www.ncbi.nlm.nih.gov/books/NBK482286/

[13] J. Makki, “Diversity of Breast Carcinoma: Histological Subtypes and Clinical Relevance,” Clin Med Insights Pathol, vol. 8, pp. 23–31, Dec. 2015, doi: 10.4137/CPath.S31563.

[14] M. Haki and R. Bayat, “Innovative Approaches for Molecular Targeted Therapy of Breast Cancer: Interfering with Various Pathway Signaling,” Int J Mol Cell Med, vol. 14, no. 1, pp. 533–551, 2025, doi: 10.22088/IJMCM.BUMS.14.1.533.

[15] J. S. Ahn et al., “Artificial Intelligence in Breast Cancer Diagnosis and Personalized Medicine,” J Breast Cancer, vol. 26, no. 5, pp. 405–435, Oct. 2023, doi: 10.4048/jbc.2023.26.e45.

[16] B. Nassima et al., “Triple negative breast cancer: Early stages management and evolution, a two years experience at the department of breast cancer of CHSF,” Clinical Journal of Obstetrics and Gynecology, vol. 3, no. 1, pp. 065–078, June 2020, doi: 10.29328/journal.cjog.1001052.

[17] S. Aymaz, “Boosting medical diagnostics with a novel gradient-based sample selection method,” Computers in Biology and Medicine, vol. 182, p. 109165, Nov. 2024, doi: 10.1016/j.compbiomed.2024.109165.

[18] N. C. López, M. T. García-Ordás, F. Vitelli-Storelli, P. Fernández-Navarro, C. Palazuelos, and R. Alaiz-Rodríguez, “Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction,” International Journal of Environmental Research and Public Health, vol. 18, no. 20, Art. no. 20, Jan. 2021, doi: 10.3390/ijerph182010670.

[19] J. Rahnenführer et al., “Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges,” BMC Med, vol. 21, p. 182, May 2023, doi: 10.1186/s12916-023-02858-y.

[20] M. Taghipour-Gorjikolaie et al., “AI-based hierarchical approach for optimizing breast cancer detection using MammoWave device,” Biomedical Signal Processing and Control, vol. 100, p. 107143, Feb. 2025, doi: 10.1016/j.bspc.2024.107143.

[21] N. Anđelić and S. Baressi Šegota, “Development of Symbolic Expressions Ensemble for Breast Cancer Type Classification Using Genetic Programming Symbolic Classifier and Decision Tree Classifier,” Cancers (Basel), vol. 15, no. 13, p. 3411, June 2023, doi: 10.3390/cancers15133411.

[22] E. Taghizadeh, S. Heydarheydari, A. Saberi, S. JafarpoorNesheli, and S. M. Rezaeijo, “Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,” BMC Bioinformatics, vol. 23, no. 1, p. 410, Oct. 2022, doi: 10.1186/s12859-022-04965-8.

[23] S. Sucharita, B. Sahu, and T. Swarnkar, “An Empirical Analysis of PCA-SVM Model for Cancer Microarray Data Classification,” in Advances in Intelligent Computing and Communication, S. Das and M. N. Mohanty, Eds., Singapore: Springer, 2021, pp. 495–504. doi: 10.1007/978-981-16-0695-3_47.

[24] M. Etehadtavakol, E. Y. K. Ng, V. Chandran, and H. Rabbani, “Separable and non-separable discrete wavelet transform based texture features and image classification of breast thermograms,” Infrared Physics & Technology, vol. 61, pp. 274–286, Nov. 2013, doi: 10.1016/j.infrared.2013.08.009.

[25] A. M. A. El-Shazli, S. M. Youssef, A. H. Soliman, and C. Chibelushi, “MSAE-DL: enhancing breast cancer classification through hybrid self-attention integration, feature fusion, and ensemble classification in digital breast tomosynthesis,” Neural Comput & Applic, vol. 37, no. 20, pp. 15635–15659, July 2025, doi: 10.1007/s00521-025-11192-8.

[26] N. Alromema, A. H. Syed, and T. Khan, “A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data,” Diagnostics (Basel), vol. 13, no. 4, p. 708, Feb. 2023, doi: 10.3390/diagnostics13040708.

[27] Q. Jiang and M. Jin, “Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression,” Front Genet, vol. 12, p. 629946, 2021, doi: 10.3389/fgene.2021.629946.

[28] D. Arora, R. Garg, and F. Asif, “BCED-Net: Breast Cancer Ensemble Diagnosis Network using transfer learning and the XGBoost classifier with mammography images,” Osong Public Health Res Perspect, vol. 15, no. 5, pp. 409–419, Oct. 2024, doi: 10.24171/j.phrp.2023.0361.

[29] A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-Autoencoder: Reduced Noise Autoencoder for classifying imbalanced cancer genomic data,” J Biol Eng, vol. 17, no. 1, p. 7, Jan. 2023, doi: 10.1186/s13036-022-00319-3.

[30] J. Zhu et al., “An integrated approach of feature selection and machine learning for early detection of breast cancer,” Sci Rep, vol. 15, no. 1, p. 13015, Apr. 2025, doi: 10.1038/s41598-025-97685-x.

[31] S. Shukla, S. Rajkumar, A. Sinha, M. Esha, K. Elango, and V. Sampath, “Federated learning with differential privacy for breast cancer diagnosis enabling secure data sharing and model integrity,” Sci Rep, vol. 15, no. 1, p. 13061, Apr. 2025, doi: 10.1038/s41598-025-95858-2.

[32] Y. Zhang, Q. Deng, W. Liang, and X. Zou, “An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data,” BioMed Research International, vol. 2018, no. 1, p. 7538204, 2018, doi: 10.1155/2018/7538204.

[33] X. Kong et al., “Research on SPDTRS-PNN based intelligent assistant diagnosis for breast cancer,” Sci Rep, vol. 13, no. 1, p. 4386, Mar. 2023, doi: 10.1038/s41598-023-28316-6.

[34] I. D. Mienye and Y. Sun, “Performance analysis of cost-sensitive learning methods with application to imbalanced medical data,” Informatics in Medicine Unlocked, vol. 25, p. 100690, Jan. 2021, doi: 10.1016/j.imu.2021.100690.

[35] S. Benghazouani, S. Nouh, and A. Zakrani, “Optimizing Breast Cancer Detection: Integrating Machine Learning with Feature Selection,” in Information Systems and Technological Advances for Sustainable Development, M. Ben Ahmed, A. A. Boudhir, H. F. Abd Elhamid Attia, A. Eštoková, and M. Zelenáková, Eds., Cham: Springer Nature Switzerland, 2024, pp. 272–282. doi: 10.1007/978-3-031-75329-9_30.

[36] A. Yaqoob et al., “SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study,” Sci Rep, vol. 15, no. 1, p. 10944, Mar. 2025, doi: 10.1038/s41598-025-95786-1.

[37] X. Sun, “Application of an improved LightGBM hybrid integration model combining gradient harmonization and Jacobian regularization for breast cancer diagnosis,” Sci Rep, vol. 15, no. 1, p. 2569, Jan. 2025, doi: 10.1038/s41598-025-86014-x.

[38] Y. Hasan, A. de Lima, E. Namjoo, D. F. de Bulnes, J. F. H. Albarracín, and C. Ryan, “Improving Breast Cancer Diagnosis Using Grammatical Evolution-Based Feature Selection,” SN COMPUT. SCI., vol. 6, no. 4, p. 306, Mar. 2025, doi: 10.1007/s42979-025-03840-6.

[39] WIlliam Wolberg, “Breast Cancer Wisconsin (Original).” UCI Machine Learning Repository, 1990. doi: 10.24432/C5HP4Z.

[40] “Breast Cancer Wisconsin (Diagnostic) Data Set.” Accessed: Oct. 20, 2025. [Online]. Available: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

[41] T. O. Omotehinwa, D. O. Oyewola, and E. G. Dada, “A Light Gradient-Boosting Machine algorithm with Tree-Structured Parzen Estimator for breast cancer diagnosis,” Healthcare Analytics, vol. 4, p. 100218, Dec. 2023, doi: 10.1016/j.health.2023.100218.

[42] A. R. W. Sait and R. Nagaraj, “An Enhanced LightGBM-Based Breast Cancer Detection Technique Using Mammography Images,” Diagnostics, vol. 14, no. 2, p. 227, Jan. 2024, doi: 10.3390/diagnostics14020227.

Published

2025-12-25

Issue

Section

Articles

How to Cite

LEVERAGING ON HYBRID MACHINE LEARNING MODELS FOR EARLY BREAST CANCER DETECTION. (2025). JP Journal of Biostatistics, 26(1), 11-40. https://doi.org/10.17654/0973514326002

Similar Articles

11-20 of 59

You may also start an advanced similarity search for this article.