Advances and Applications in Statistics

The Advances and Applications in Statistics is an internationally recognized journal indexed in the Emerging Sources Citation Index (ESCI). It provides a platform for original research papers and survey articles in all areas of statistics, both computational and experimental in nature.

Submit Article

A COMPARATIVE STUDY OF TOPIC MODELLING TECHNIQUES AND SVM CLASSIFICATION FOR THE EXTRACTION OF EMERGING THEMES ON IMMUNITY FROM CORD-19

Authors

  • S. K. M. Jeyasree
  • G. Vijayasree

Keywords:

topic modelling, CORD-19, latent Dirichlet allocation, support vector machine, text mining

DOI:

https://doi.org/10.17654/0972361725069

Abstract

The objective of this study is to explore thematic structures and classify abstracts related to innate and adaptive immunity extracted from the CORD-19 dataset. The study aims to evaluate the effectiveness of various topic modelling and classification techniques for uncovering key topics and patterns in the dataset. For topic modelling, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-negative Matrix Factorization (NMF) were employed. Additionally, a Support Vector Machine (SVM) classifier with LSA-reduced features was applied to evaluate classification performance across various topic numbers (k-values). To address class imbalance, Synthetic Minority Oversampling Technique (SMOTE) was used. The SVM model, trained with an RBF kernel, achieved high classification performance, as evidenced by the confusion matrix, ROC curve, and classification report. The performance of models was assessed using precision, recall and F1-score. Research findings included identifying top terms from topic models and extracting term from the SVM model. The results demonstrated that LDA with Gibbs sampling, variational EM and SVM with LSA reduction outperformed other methods in terms of classification accuracy and topic coherence. The study highlights the potential of combining topic modelling and machine learning techniques for analyzing scientific literature. The findings contribute to understanding emerging themes in innate and adaptive immunity research. This work offers valuable insights for researchers and healthcare professionals by enabling efficient exploration of large-scale biomedical datasets and supporting further research on immune responses.

Received: June 10, 2025
Accepted: October 3, 2025

References

[1] R. Akbani, S. Kwek and N. Japkowicz, Applying support vector machines to imbalanced datasets, Lecture Notes in Computer Science, 2014, pp. 39-50. https://doi.org/10.1007/978-3-540-30115-8.

[2] D. M. Blei, Probabilistic topic models, Communications of the ACM 4 (2012), 77-84. Doi: 10.1145/2133806.2133826.

[3] D. M. Blei, A. Kucukelbir and J. D. Mcauliffe, Variational Inference: A Review for Statisticians, 2018, pp. 1-41. arXiv:1601.00670v9.

[4] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993-1022.

[5] L. Brusa, F. Pennoni and F. Bartolucci, Maximum likelihood estimation for discrete latent variable models via evolutionary algorithms, Statistics and Computing 34 (2024), 123. https://doi.org/10.1007/s11222-023-10358-5.

[6] S. Deerwester, S. T. Dumais, G. W. Furnas and T. K. Landauer, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6) (1990), 391-407.

[7] T. L. Griffiths and M. Steyvers, Finding scientific topics, The National Academy of Sciences of the USA 101(Suppl. 1) (2004), 5228-5235. https://doi.org/10.1073/pnas.0307752101.

[8] D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999), 788-791. https://doi.org/10.1038/44565.

[9] X. Luo, Efficient English text classification using selected Machine Learning Techniques, Alexandria Engineering Journal 60(3) (2021), 3401-3409.

https://doi.org/10.1016/j.aej.2021.02.009.

[10] J. N. Mandrekar, Receiver operating characteristic curve in diagnostic test assessment, Journal of Thoracic Oncology 5 (2010), 1315-1316.

[11] Michael Olalekan Ajinaja, Adebayo Olusola Adetunmbi, Chukwuemeka Christian Ugwu and Olugbemiga Solomon Popoọla, Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling, Iran Journal of Computer Science 6 (2023), 81-94.

https://doi.org/10.1007/s42044-022-00124-7.

[12] Vapnik Vladimir and Corinna Corter, Support-vector networks, Machine Learning 20 (1995), 273-297. https://doi.org/10.1007/BF00994018.

[13] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Christopher Watkins, Text classification using string kernels, Journal of Machine Learning Research 2 (2002), 419-444. 10.1162/153244302760200687.

[14] S. K. M. Jeyasree, G. Vijayasree and R. Geetha, Implementation of structural topic modelling for abstract mining from CORD-19, Lecture notes in Networks and Systems, Proceedings of the 13th International Conference on Soft Computing and Pattern Recognition 417 (2022), 347-360.

https://link.springer.com/book/10.1007/978-3-030-96302-6.

[15] M. Steyvers and T. Griffiths, Probabilistic topic models, Handbook of Latent Semantic Analysis, eBook, 2007.

[16] D. M. Titterington, Approximate Bayesian inference for simple mixtures, COMPSTAT, 2000, pp. 331-336. https://doi.org/10.1007/978-3-642-57678-2_42.

[17] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson and S. Kohlmeier, CORD-19: The COVID -19 Open Research Dataset, arXiv:2004.10706v4 [cs.DL], 2020.

https://doi.org/10.48550/arXiv.2004.10706.

Published

25-10-2025

Issue

Section

Articles

How to Cite

A COMPARATIVE STUDY OF TOPIC MODELLING TECHNIQUES AND SVM CLASSIFICATION FOR THE EXTRACTION OF EMERGING THEMES ON IMMUNITY FROM CORD-19. (2025). Advances and Applications in Statistics , 92(11), 1605-1633. https://doi.org/10.17654/0972361725069

Similar Articles

1-10 of 102

You may also start an advanced similarity search for this article.