IMPACT OF DATA PREPROCESSING AND BALANCING ON DIABETES PREDICTION USING THE DECISION TREE TECHNIQUE
Keywords:
decision tree technique, diabetes detection, data preprocessing, SMOTE balancing, K-fold cross-validationDOI:
https://doi.org/10.17654/0975045223008Abstract
The International Diabetes Federation (IDF) has reported that by 2021, 537 million people worldwide have diabetes. In addition, about one third of people with diabetes are not aware that they have the disease. The number of diabetics is estimated to reach 783 million by 2045. However, the most common type of diabetes, type 2 diabetes (T2D), is largely preventable and can be prevented if effective early prediction is possible. The technologies of the 4e industrial revolution allow for the availability and exploration of huge amounts of data, including electronic medical records. But gaining knowledge and understanding of this data remains a major challenge. However, the latest advances in machine learning technologies can be applied to obtain hidden patterns, which can diagnose diabetes at an early stage. This work presents a methodology for modelling diabetes prediction from data, using various machine learning algorithms based on the decision tree technique. The study was carried out around three axes which are preprocessing, SMOTE (synthetic minority over-sampling technique) balancing and K-fold cross-validation, through three case scenarios.
Received: December 30, 2022
Revised: March 22, 2023
Accepted: April 19, 2023
References
International Diabetes Federation, IDF Atlas 10th ed., 2021.
P. Z. Zimmet, D. J. Magliano, W. H. Herman and J. E. Shaw, Diabetes: a 21st century challenge, The Lancet Diabetes and Endocrinology 2(1) (2014), 56-64. doi: 10.1016/S2213-8587(13)70112-8.
M. A. Atkinson, G. S. Eisenbarth and A. W. Michels, Type 1 diabetes, The Lancet 383(9911) (2014), 69-82. doi: 10.1016/S0140-6736(13)60591-7.
S. Chatterjee, K. Khunti and M. J. Davies, Type 2 diabetes, The Lancet 389(10085) (2017), 2239-2251. doi: 10.1016/S0140-6736(17)30058-2.
Amar Abderrahmani, Mathie Tenenbaum, Amélie Bonnefond and Philippe Froguel, Physiopathology of diabetes, Scientific File, Revue Francophone des Laboratoires No. 502, May 2018.
H. D. McIntyre, P. Catalano, C. Zhang, G. Desoye, E. R. Mathiesen and P. Damm, Gestational diabetes mellitus, Nature Reviews Disease Primers 5(1) (2019), Article ID 47. doi: 10.1038/s41572-019-0098-8.
L. Bellamy, J. P. Casas, A. D. Hingorani and D. Williams, Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis, The Lancet 373(9677) (2009), 1773-1779. doi: 10.1016/S0140-6736(09)60731-5.
A. Ramachandran, Know the signs and symptoms of diabetes, Indian J. Med. Res. 140 (2014), 579-581.
Matti Uusitupa, Tauseef A. Khan, Effie Viguiliouk, Hana Kahleova, Angela A. Rivellese, Kjeld Hermansen, Andreas Pfeiffer, Anastasia Thanopoulou, Jordi Salas-Salvadó, Ursula Schwab and John L. Sievenpiper, Prevention of type 2 diabetes by lifestyle changes: a systematic review and meta-analysis, Nutrients 11(11) (2019), 2611. doi: 10.3390/nu11112611.
I. Kyrou, C. Tsigos, C. Mavrogianni, G. Cardon, V. V. Stappen, J. Latomme, J. Kivelä, K. Wikström, K. Tsochev, A. Nanasi, C. Semanova, R. Mateo-Gallego, I. Lamiquiz-Moneo, G. Dafoulas, P. Timpel, Peter E. H. Schwarz, V. Iotova, T. Tankova, K. Makrilakis and Y. Manios, Sociodemographic and lifestyle-related risk factors for identifying vulnerable groups for type 2 diabetes: a narrative review with emphasis on data from Europe, BMC Endocrine Disorders, Vol. 20, BioMed. Central Ltd., 2020. doi: 10.1186/s12902-019-0463-3
IEEE Staff, IEEE/ACS International Conference on Computer Systems and Applications, 2008.
G. Swapna, R. Vinayakumar and K. P. Soman, Diabetes detection using deep learning algorithms, ICT Express, 4(4) (2018), 243-246. doi: 10.1016/j.icte.2018.10.005.
A. Singh, M. N. Halgamuge and R. Lakshmiganthan, Impact of different data types on classifier performance of RF, NB and KNN Algorithms, International Journal of Advanced Computer Science and Applications (IJACSA) 8 (2017). doi: 10.14569/issn.2156-5570.
Kumarmangal Roy, Muneer Ahmad, Kinza Waqar, Kirthanaah Priyaah, Jamal Nebhen, Sultan S. Alshamrani, Muhammad Ahsan Raza and Ihsan Ali, An enhanced machine learning framework for Type 2 diabetes classification using imbalanced data with missing values, Complexity 2021 (2021), 1-21. doi: 10.1155/2021/9953314.
Z. Mushtaq, M. F. Ramzan, S. Ali, S. Baseer, A. Samad and M. Husnain, Voting classification-based diabetes mellitus prediction using hypertuned machine-learning techniques, Mobile Information Systems 2022 (2022), 1-16. doi: 10.1155/2022/6521532.
Sourav Kumar Bhoi, Sanjaya Kumar Panda, Kalyan Kumar Jena, P. Anshuman Abhisekh, Kshira Sagar Sahoo, Najm Us Sama, Shweta Supriya Pradhan and Rashmi Ranjan Sahoo, Prediction of diabetes in females of pima Indian heritage: a complete supervised learning approach, 2021. doi: https://doi.org/10.17762/turcomat.v12i10.4958.
E. Pekel Özmen and T. Özcan, Diagnosis of diabetes mellitus using artificial neural network and classification and regression tree optimized with genetic algorithm, J. Forecast. 39(4) (2020), 661-670. doi: 10.1002/for.2652.
S. Sivaranjani, S. Ananya, J. Aravinth and R. Karthika, Diabetes Prediction using machine learning algorithms with feature selection and dimensionality reduction, 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), 2021, pp. 141-146. doi: 10.1109/ICACCS51430.2021.9441935.
N. P. Tigga and S. Garg, Prediction of Type 2 diabetes using machine learning classification methods, Procedia Comput. Sci. 167 (2020), 706-716. doi: 10.1016/j.procs.2020.03.336.
E. Dritsas and M. Trigka, Data-driven machine-learning methods for diabetes risk prediction, Sensors 22(14) (2022), 5304. doi: 10.3390/s22145304.
F. Ridzuan and W. M. N. Wan Zainon, Diagnostic analysis for outlier detection in big data analytics, Procedia Comput. Sci. 197 (2022), 685-692. doi: 10.1016/j.procs.2021.12.189.
Viviane Planchon, Outlier treatment: current concepts and general trends, 2005. Accessed: Nov. 22, 2022. [Online]. Available: https://popups.uliege.be/1780-4507/index.php?id=13859.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Phillip Kegelmeyer, SMOTE (Synthetic Minority Over-sampling Technique), Journal of Artificial Intelligence Research 16 (2002).
I. H. Witten, E. Frank, M. A. Hall and C. J. Pal, Credibility: evaluating what’s been learned, Data Mining: Practical Machine Learning Tools and Techniques, 2005, pp. 143-186.
I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes and S. J. Cunningham, Weka: practical machine learning tools and techniques with Java implementations, Working paper, 1999.
H. Benhar, A. Idri and J. L. Fernández-Alemán, Data preprocessing for decision making in medical informatics: potential and analysis, Advances in Intelligent Systems and Computing 746 (2018), 1208-1218. doi: 10.1007/978-3-319-77712-2_116.
P. Misra and A. S. Yadav, Impact of preprocessing methods on healthcare predictions, SSRN Electronic Journal (2019). doi: 10.2139/SSRN.3349586.
D. B. Rubin, Inference and missing data, Biometrika 63(3) (1976), 581-592. doi: 10.1093/biomet/63.3.581.
Md. Maniruzzaman, Md. J. Rahman, B. Ahammed and Md. M. Abedin, Classification and prediction of diabetes disease using machine learning paradigm, Health Inf. Sci. Syst. 8(1) (2020), 7. doi: 10.1007/s13755-019-0095-z.
J. J. Khanam and S. Y. Foo, A comparison of machine learning algorithms for diabetes prediction, ICT Express 7(4) (2021), 432-439. doi: 10.1016/j.icte.2021.02.004.
I. Gnanadass, Prediction of gestational diabetes by machine learning algorithms, IEEE Potentials 39(6) (2020), 32-37. doi: 10.1109/MPOT.2020.3015190.
S. Kolo, J. Grace, Y. Edwige, K. K. Hyacinthe, A. Olivier and B. Daniel, Predictive analysis of diabetes without data pre-processing via the evaluation of tree algorithms, Int. J. Adv. Res. (Indore) 10(12) (2022), pp. 1059-1069. doi: 10.21474/IJAR01/15940.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 PUSHPA PUBLISHING HOUSE, PRAYAGRAJ, INDIA

This work is licensed under a Creative Commons Attribution 4.0 International License.
Attribution: Credit Pushpa Publishing House as the original publisher, including title and author(s) if applicable.
Non-Commercial Use: For non-commercial purposes only. No commercial activities without explicit permission.
No Derivatives: Modifying or creating derivative works not allowed without written permission.
Contact Pushpa Publishing House for more info or permissions.



Publication count:
Google h-index: