JP Journal of Biostatistics

The JP Journal of Biostatistics is a highly regarded open-access international journal indexed in the Emerging Sources Citation Index (ESCI). It focuses on the application of statistical theory and methods in resolving problems in biological, biomedical, and agricultural sciences. The journal encourages the submission of experimental papers that employ relevant algorithms and also welcomes survey articles in the fields of biostatistics and epidemiology.

Submit Article

A COMPARATIVE ANALYSIS OF PROBABILISTIC AND MACHINE LEARNING MODELS FOR BIOLOGICAL SEQUENCE CLASSIFICATION

Authors

  • J. Jeslin
  • A. Radhika
  • M. Haripriya

Keywords:

machine learning, Markov model, maximum likelihood estimation, natural language processing, sequence analysis

DOI:

https://doi.org/10.17654/0973514325014

Abstract

The human genome, encompassing about six billion nucleotides, is a comprehensive blueprint for biological processes. Understanding genetic diversity enables the identification of disease susceptibility genes, potential drug targets, and the development of diagnostic tools. Automated classification is essential in genome projects due to the vast amount of DNA sequences. This study employed probabilistic modeling (specifically Markov model and maximum likelihood estimation) along with machine learning (ML) classifiers integrated with natural language processing (NLP) to analyze intricate genomic datasets, focusing on gene family classification. We evaluated our approach on a dataset of raw Human DNA sequences, and the results demonstrated that the multinomial Naïve Bayes and multilayer perceptron classifiers achieved specificity and sensitivity exceeding 90%. The Markov model yielded results on par with ML classifiers, indicating its significance in genetics and healthcare advancements. The study’s approach, incorporating stochastic models, NLP techniques, and ML algorithms, establishes a robust framework for future genomics research and promises further insights into genetic mechanisms and therapeutic interventions.

Received: October 23, 2024
Accepted: February 14, 2025

References

M. W. Ahmad, J. Reynolds and Y. Rezgui, Predictive modelling for solar thermal energy systems: A comparison of support vector regression, random forest, extra trees and regression trees, Journal of Cleaner Production 203 (2018), 810-821. https://doi.org/10.1016/j.jclepro.2018.08.207.

F. Alharbi and A. Vakanski, Machine learning methods for cancer classification using gene expression data: A Review, Bioengineering 10(2) (2023), 173.

https://doi.org/10.3390/bioengineering10020173.

L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie and L. Farhan, Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions, Journal of Big Data 8(1) (2021), 53. https://doi.org/10.1186/s40537-021-00444-8.

H. Bonthu, An Introduction to Logistic Regression, 2021.

https://www.analyticsvidhya.com/blog/2021/07/an-introduction-to-logistic-regression/.

J. Brownlee, Support Vector Machines for Machine Learning, 2016. MachineLearningMastery.Com.

https://machinelearningmastery.com/support-vector-machines-for-machine-learning/.

A. Chatterjee, J. Saha and J. Mukherjee, Clustering with multi-layered perceptron, Pattern Recognition Letters 155 (2022), 92-99.

https://doi.org/10.1016/j.patrec.2022.02.009.

D. Eisenman, S. Debold and J. Riddle, A changing world in gene therapy research: exciting opportunities for medical advancement and biosafety challenges, Applied Biosafety: Journal of the American Biological Safety Association 26(4) (2021), 179-192. https://doi.org/10.1089/apb.2021.0020.

D.-C. Feng, Z.-T. Liu, X.-D. Wang, Y. Chen, J.-Q. Chang, D.-F. Wei and Z.-M. Jiang, Machine learning-based compressive strength prediction for concrete: An adaptive boosting approach, Construction and Building Materials 230 (2020), 117000. https://doi.org/10.1016/j.conbuildmat.2019.117000.

R. Guigó, Genome annotation: From human genetics to biodiversity genomics, Cell Genomics 3(8) (2023), 100375. https://doi.org/10.1016/j.xgen.2023.100375.

W. Hao, M. Song and J. D. Storey, Probabilistic models of genetic variation in structured populations applied to global human studies, Bioinformatics 32(5) (2016), 713-721. https://doi.org/10.1093/bioinformatics/btv641.

M. Jiang, J. Wang, L. Hu and Z. He, Random forest clustering for discrete sequences, Pattern Recognition Letters 174 (2023), 145-151.

https://doi.org/10.1016/j.patrec.2023.09.001.

K. Bhanumathy, A. Balagopal, F. S. Vizeacoumar, F. J. Vizeacoumar, A. Freywald and V. Giambra, Protein tyrosine kinases: their roles and their targeting in Leukemia, Cancers 13(2) (2021), 184.

https://doi.org/10.3390/cancers13020184.

S.-M. Kim, S. Park, S.-H. Hwang, E.-Y. Lee, J.-H. Kim, G. S. Lee, G. Lee, D.-H. Chang, J.-G. Lee, J. Hwang, Y. Lee, M. Kyung, E.-K. Kim, J.-H. Kim, T.-H. Kim, J. H. Moon, B.-C. Kim, G. Ko, S.-Y. Kim and M. H. Kim, Secreted Akkermansia muciniphila threonyl-tRNA synthetase functions to monitor and modulate immune homeostasis, Cell Host and Microbe 31(6) (2023), 1021-1037.e10.

https://doi.org/10.1016/j.chom.2023.05.007.

D. B. Kohn, Y. Y. Chen and M. J. Spencer, Successes and challenges in clinical gene therapy, Gene Therapy 30(10-11) (2023), 738-746.

https://doi.org/10.1038/s41434-023-00390-5.

S. K. Lakshitha, V. Naga Pranava Shashank, Richa and S. Gupta, A comparison of multinomial Naïve Bayes and bidirectional LSTM for emotion detection, In S. J. C. Aurelia, A. Immanuel, J. Mani and V. Padmanabha, eds., Computational Sciences and Sustainable Technologies, Springer Nature Switzerland, 2024, pp. 322-332. https://doi.org/10.1007/978-3-031-50993-3_26.

B. Li, J. Liang, H. R. Baniasadi, S. Kurihara, M. A. Phillips and A. J. Michael, Functional identification of bacterial spermine, thermospermine, norspermine, norspermidine, spermidine, and N1-aminopropylagmatine synthases, Journal of Biological Chemistry (2024), 107281. https://doi.org/10.1016/j.jbc.2024.107281.

Z. Li, E. Gao, J. Zhou, W. Han, X. Xu and X. Gao, Applications of deep learning in understanding gene regulation, Cell Reports Methods 3(1) (2023), 100384.

https://doi.org/10.1016/j.crmeth.2022.100384.

J. Montomoli, L. Romeo, S. Moccia, M. Bernardini, L. Migliorelli, D. Berardini, A. Donati, A. Carsetti, M. G. Bocci, P. D. Wendel Garcia, T. Fumeaux, P. Guerci, R. A. Schüpbach, C. Ince, E. Frontoni and M. P. Hilty, Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients, Journal of Intensive Medicine 1(2) (2021), 110-116. https://doi.org/10.1016/j.jointm.2021.09.002.

W. Pomp, J. V. W. Meeussen and T. L. Lenstra, Transcription factor exchange enables prolonged transcriptional bursts, Molecular Cell 84(6) (2024), 1036-1048.e9. https://doi.org/10.1016/j.molcel.2024.01.020.

A. Priyam, B. M. Karan and G. Sahoo, A probabilistic model for sequence analysis, International Journal of Computer Science and Information Security 7(1) (2010). https://doi.org/10.48550/arXiv.1002.2412.

S. Rehman, N. Rahimi and M. Dimri, Biochemistry, G Protein Coupled Receptors, In StatPearls, StatPearls Publishing, 2024.

http://www.ncbi.nlm.nih.gov/books/NBK518966/.

E. Routhier and J. Mozziconacci, Genomics enters the deep learning era, PeerJ. 10 (2022), e13613. https://doi.org/10.7717/peerj.13613.

I. Saif, Y. Kasmi, K. Allali and M. M. Ennaji, Prediction of DNA methylation in the promoter of gene suppressor tumor, Gene 651 (2018), 166-173.

https://doi.org/10.1016/j.gene.2018.01.082.

D. Seth, K. Dharmanshu Mahajan, R. Khanna and G. Chugh, Gene Family Classification Using Machine Learning: A Comparative Analysis, In A. Swaroop, Z. Polkowski, S. D. Correia and B. Virdee, eds., Proceedings of Data Analytics and Management, Springer Nature, 2023, 343-360.

https://doi.org/10.1007/978-981-99-6553-3_27.

A. Sharma, M. K. Sharma and R. Kr. Dwivedi, Exploratory data analysis and deception detection in news articles on social media using machine learning classifiers, Ain Shams Engineering Journal 14(10) (2023), 102166.

https://doi.org/10.1016/j.asej.2023.102166.

S. Shukla, B. Mishra, H. Avashthi and M. Chandra, Chapter 3-biological sequence analysis, In D. B. Singh and R. K. Pathak, eds., Bioinformatics, Academic Press, 2022, pp. 33-47.

https://doi.org/10.1016/B978-0-323-89775-4.00003-1.

E. F. Siddiqui, T. Ahmed and S. K. Nayak, A decision tree approach for enhancing real-time response in exigent healthcare unit using edge computing, Measurement: Sensors, 32 (2024), 100979.

https://doi.org/10.1016/j.measen.2023.100979.

Q. Wang, Y. Ye, L. Yang, L. Xiao, J. Liu, W. Zhang and G. Du, Painful diabetic neuropathy: The role of ion channels, Biomedicine and Pharmacotherapy 173 (2024), 116417. https://doi.org/10.1016/j.biopha.2024.116417.

C. L. Welsh, P. Pandey and L. G. Ahuja, Protein Tyrosine Phosphatases: A new paradigm in an old signaling system? Advances in Cancer Research 152 (2021), 263-303. https://doi.org/10.1016/bs.acr.2021.06.001.

Published

2025-04-15

Issue

Section

Articles

How to Cite

A COMPARATIVE ANALYSIS OF PROBABILISTIC AND MACHINE LEARNING MODELS FOR BIOLOGICAL SEQUENCE CLASSIFICATION. (2025). JP Journal of Biostatistics, 25(2), 273-294. https://doi.org/10.17654/0973514325014

Similar Articles

1-10 of 96

You may also start an advanced similarity search for this article.