LEVERAGING ON HYBRID MACHINE LEARNING MODELS FOR EARLY BREAST CANCER DETECTION
Keywords:
breast cancer, LightGBM, principal component analysis, Borderline-SMOTEDOI:
https://doi.org/10.17654/0973514326002Abstract
Breast cancer is among the most common cancers in women worldwide, and outcomes improve with early detection. As machine learning enters routine care, data driven diagnostic systems may support earlier risk estimation. We present a compact pipeline that uses Principal Component Analysis for dimensionality reduction and Borderline-SMOTE for imbalance correction, followed by classification with Light Gradient Boosting Machine. Using the standardized Wisconsin Breast Cancer Diagnostic dataset, we retain 20 features to capture key variance while limiting redundancy and noise. Borderline-SMOTE is applied within each training fold to refine class boundaries. Performance is evaluated with stratified 10‑fold cross validation and compared with seven alternatives: XGBoost, Support Vector Machines, Random Forests, Logistic Regression, Gaussian Naive Bayes, k Nearest Neighbor, and a Multilayer Perceptron. With 20 components, the proposed model attains accuracy 0.993, precision 1, recall 0.986, F1 0.993, and AUC 1.000 for distinguishing benign from malignant cases, outperforming baselines. These findings suggest that coupling dimensionality reduction, boundary focused resampling, and gradient boosted trees can enhance diagnostic performance and may inform clinical decision support.
Received: October 25, 2025
Accepted: December 8, 2025
References
[1] H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA: A Cancer Journal for Clinicians, vol. 71, no. 3, pp. 209–249, 2021, doi: 10.3322/caac.21660.
[2] A. Q. Khan et al., “Advances in breast cancer diagnosis: a comprehensive review of imaging, biosensors, and emerging wearable technologies,” Front Oncol, vol. 15, p. 1587517, June 2025, doi: 10.3389/fonc.2025.1587517.
[3] Y.-J. Qi et al., “Radiomics in breast cancer: Current advances and future directions,” Cell Reports Medicine, vol. 5, no. 9, p. 101719, Sept. 2024, doi: 10.1016/j.xcrm.2024.101719.
[4] L. Quinn et al., “Interobserver variability studies in diagnostic imaging: a methodological systematic review,” Br J Radiol, vol. 96, no. 1148, p. 20220972, Aug. 2023, doi: 10.1259/bjr.20220972.
[5] K. Puttegowda et al., “Enhanced machine learning models for accurate breast cancer mammogram classification,” Global Transitions, vol. 7, pp. 276–295, Jan. 2025, doi: 10.1016/j.glt.2025.04.007.
[6] A. Khalid et al., “Breast Cancer Detection and Prevention Using Machine Learning,” Diagnostics (Basel), vol. 13, no. 19, p. 3113, Oct. 2023, doi: 10.3390/diagnostics13193113.
[7] K. Fujiwara, “Knowledge distillation with resampling for imbalanced data classification: Enhancing predictive performance and explainability stability,” Results in Engineering, vol. 24, p. 103406, Dec. 2024, doi: 10.1016/j.rineng.2024.103406.
[8] J. L. Cross, M. A. Choma, and J. A. Onofrey, “Bias in medical AI: Implications for clinical decision-making,” PLOS Digit Health, vol. 3, no. 11, p. e0000651, Nov. 2024, doi: 10.1371/journal.pdig.0000651.
[9] B. F. Azevedo, A. M. A. C. Rocha, and A. I. Pereira, “Hybrid approaches to optimization and machine learning methods: a systematic literature review,” Mach Learn, vol. 113, no. 7, pp. 4055–4097, July 2024, doi: 10.1007/s10994-023-06467-x.
[10] Y. Amethiya, P. Pipariya, S. Patel, and M. Shah, “Comparative analysis of breast cancer detection using machine learning and biosensors,” Intelligent Medicine, vol. 2, no. 2, pp. 69–81, May 2022, doi: 10.1016/j.imed.2021.08.004.
[11] K. Adem, “Diagnosis of breast cancer with Stacked autoencoder and Subspace kNN,” Physica A: Statistical Mechanics and its Applications, vol. 551, p. 124591, Aug. 2020, doi: 10.1016/j.physa.2020.124591.
[12] G. Menon, F. M. Alkabban, and T. Ferguson, “Breast Cancer,” in StatPearls, Treasure Island (FL): StatPearls Publishing, 2025. Accessed: July 16, 2025. [Online]. Available: http://www.ncbi.nlm.nih.gov/books/NBK482286/
[13] J. Makki, “Diversity of Breast Carcinoma: Histological Subtypes and Clinical Relevance,” Clin Med Insights Pathol, vol. 8, pp. 23–31, Dec. 2015, doi: 10.4137/CPath.S31563.
[14] M. Haki and R. Bayat, “Innovative Approaches for Molecular Targeted Therapy of Breast Cancer: Interfering with Various Pathway Signaling,” Int J Mol Cell Med, vol. 14, no. 1, pp. 533–551, 2025, doi: 10.22088/IJMCM.BUMS.14.1.533.
[15] J. S. Ahn et al., “Artificial Intelligence in Breast Cancer Diagnosis and Personalized Medicine,” J Breast Cancer, vol. 26, no. 5, pp. 405–435, Oct. 2023, doi: 10.4048/jbc.2023.26.e45.
[16] B. Nassima et al., “Triple negative breast cancer: Early stages management and evolution, a two years experience at the department of breast cancer of CHSF,” Clinical Journal of Obstetrics and Gynecology, vol. 3, no. 1, pp. 065–078, June 2020, doi: 10.29328/journal.cjog.1001052.
[17] S. Aymaz, “Boosting medical diagnostics with a novel gradient-based sample selection method,” Computers in Biology and Medicine, vol. 182, p. 109165, Nov. 2024, doi: 10.1016/j.compbiomed.2024.109165.
[18] N. C. López, M. T. García-Ordás, F. Vitelli-Storelli, P. Fernández-Navarro, C. Palazuelos, and R. Alaiz-Rodríguez, “Evaluation of Feature Selection Techniques for Breast Cancer Risk Prediction,” International Journal of Environmental Research and Public Health, vol. 18, no. 20, Art. no. 20, Jan. 2021, doi: 10.3390/ijerph182010670.
[19] J. Rahnenführer et al., “Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges,” BMC Med, vol. 21, p. 182, May 2023, doi: 10.1186/s12916-023-02858-y.
[20] M. Taghipour-Gorjikolaie et al., “AI-based hierarchical approach for optimizing breast cancer detection using MammoWave device,” Biomedical Signal Processing and Control, vol. 100, p. 107143, Feb. 2025, doi: 10.1016/j.bspc.2024.107143.
[21] N. Anđelić and S. Baressi Šegota, “Development of Symbolic Expressions Ensemble for Breast Cancer Type Classification Using Genetic Programming Symbolic Classifier and Decision Tree Classifier,” Cancers (Basel), vol. 15, no. 13, p. 3411, June 2023, doi: 10.3390/cancers15133411.
[22] E. Taghizadeh, S. Heydarheydari, A. Saberi, S. JafarpoorNesheli, and S. M. Rezaeijo, “Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods,” BMC Bioinformatics, vol. 23, no. 1, p. 410, Oct. 2022, doi: 10.1186/s12859-022-04965-8.
[23] S. Sucharita, B. Sahu, and T. Swarnkar, “An Empirical Analysis of PCA-SVM Model for Cancer Microarray Data Classification,” in Advances in Intelligent Computing and Communication, S. Das and M. N. Mohanty, Eds., Singapore: Springer, 2021, pp. 495–504. doi: 10.1007/978-981-16-0695-3_47.
[24] M. Etehadtavakol, E. Y. K. Ng, V. Chandran, and H. Rabbani, “Separable and non-separable discrete wavelet transform based texture features and image classification of breast thermograms,” Infrared Physics & Technology, vol. 61, pp. 274–286, Nov. 2013, doi: 10.1016/j.infrared.2013.08.009.
[25] A. M. A. El-Shazli, S. M. Youssef, A. H. Soliman, and C. Chibelushi, “MSAE-DL: enhancing breast cancer classification through hybrid self-attention integration, feature fusion, and ensemble classification in digital breast tomosynthesis,” Neural Comput & Applic, vol. 37, no. 20, pp. 15635–15659, July 2025, doi: 10.1007/s00521-025-11192-8.
[26] N. Alromema, A. H. Syed, and T. Khan, “A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data,” Diagnostics (Basel), vol. 13, no. 4, p. 708, Feb. 2023, doi: 10.3390/diagnostics13040708.
[27] Q. Jiang and M. Jin, “Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression,” Front Genet, vol. 12, p. 629946, 2021, doi: 10.3389/fgene.2021.629946.
[28] D. Arora, R. Garg, and F. Asif, “BCED-Net: Breast Cancer Ensemble Diagnosis Network using transfer learning and the XGBoost classifier with mammography images,” Osong Public Health Res Perspect, vol. 15, no. 5, pp. 409–419, Oct. 2024, doi: 10.24171/j.phrp.2023.0361.
[29] A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-Autoencoder: Reduced Noise Autoencoder for classifying imbalanced cancer genomic data,” J Biol Eng, vol. 17, no. 1, p. 7, Jan. 2023, doi: 10.1186/s13036-022-00319-3.
[30] J. Zhu et al., “An integrated approach of feature selection and machine learning for early detection of breast cancer,” Sci Rep, vol. 15, no. 1, p. 13015, Apr. 2025, doi: 10.1038/s41598-025-97685-x.
[31] S. Shukla, S. Rajkumar, A. Sinha, M. Esha, K. Elango, and V. Sampath, “Federated learning with differential privacy for breast cancer diagnosis enabling secure data sharing and model integrity,” Sci Rep, vol. 15, no. 1, p. 13061, Apr. 2025, doi: 10.1038/s41598-025-95858-2.
[32] Y. Zhang, Q. Deng, W. Liang, and X. Zou, “An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data,” BioMed Research International, vol. 2018, no. 1, p. 7538204, 2018, doi: 10.1155/2018/7538204.
[33] X. Kong et al., “Research on SPDTRS-PNN based intelligent assistant diagnosis for breast cancer,” Sci Rep, vol. 13, no. 1, p. 4386, Mar. 2023, doi: 10.1038/s41598-023-28316-6.
[34] I. D. Mienye and Y. Sun, “Performance analysis of cost-sensitive learning methods with application to imbalanced medical data,” Informatics in Medicine Unlocked, vol. 25, p. 100690, Jan. 2021, doi: 10.1016/j.imu.2021.100690.
[35] S. Benghazouani, S. Nouh, and A. Zakrani, “Optimizing Breast Cancer Detection: Integrating Machine Learning with Feature Selection,” in Information Systems and Technological Advances for Sustainable Development, M. Ben Ahmed, A. A. Boudhir, H. F. Abd Elhamid Attia, A. Eštoková, and M. Zelenáková, Eds., Cham: Springer Nature Switzerland, 2024, pp. 272–282. doi: 10.1007/978-3-031-75329-9_30.
[36] A. Yaqoob et al., “SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis: A comparative study,” Sci Rep, vol. 15, no. 1, p. 10944, Mar. 2025, doi: 10.1038/s41598-025-95786-1.
[37] X. Sun, “Application of an improved LightGBM hybrid integration model combining gradient harmonization and Jacobian regularization for breast cancer diagnosis,” Sci Rep, vol. 15, no. 1, p. 2569, Jan. 2025, doi: 10.1038/s41598-025-86014-x.
[38] Y. Hasan, A. de Lima, E. Namjoo, D. F. de Bulnes, J. F. H. Albarracín, and C. Ryan, “Improving Breast Cancer Diagnosis Using Grammatical Evolution-Based Feature Selection,” SN COMPUT. SCI., vol. 6, no. 4, p. 306, Mar. 2025, doi: 10.1007/s42979-025-03840-6.
[39] WIlliam Wolberg, “Breast Cancer Wisconsin (Original).” UCI Machine Learning Repository, 1990. doi: 10.24432/C5HP4Z.
[40] “Breast Cancer Wisconsin (Diagnostic) Data Set.” Accessed: Oct. 20, 2025. [Online]. Available: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
[41] T. O. Omotehinwa, D. O. Oyewola, and E. G. Dada, “A Light Gradient-Boosting Machine algorithm with Tree-Structured Parzen Estimator for breast cancer diagnosis,” Healthcare Analytics, vol. 4, p. 100218, Dec. 2023, doi: 10.1016/j.health.2023.100218.
[42] A. R. W. Sait and R. Nagaraj, “An Enhanced LightGBM-Based Breast Cancer Detection Technique Using Mammography Images,” Diagnostics, vol. 14, no. 2, p. 227, Jan. 2024, doi: 10.3390/diagnostics14020227.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 PUSHPA PUBLISHING HOUSE, PRAYAGRAJ, INDIA

This work is licensed under a Creative Commons Attribution 4.0 International License.
_________________________
Attribution: Credit Pushpa Publishing House as the original publisher, including title and author(s) if applicable.
Non-Commercial Use: For non-commercial purposes only. No commercial activities without explicit permission.
No Derivatives: Modifying or creating derivative works not allowed without written permission.
Contact Puspha Publishing House for more info or permissions.
Journal Impact Factor: 


Google h-index: 10