A COMPARATIVE STUDY OF TOPIC MODELLING TECHNIQUES AND SVM CLASSIFICATION FOR THE EXTRACTION OF EMERGING THEMES ON IMMUNITY FROM CORD-19
Keywords:
topic modelling, CORD-19, latent Dirichlet allocation, support vector machine, text miningDOI:
https://doi.org/10.17654/0972361725069Abstract
The objective of this study is to explore thematic structures and classify abstracts related to innate and adaptive immunity extracted from the CORD-19 dataset. The study aims to evaluate the effectiveness of various topic modelling and classification techniques for uncovering key topics and patterns in the dataset. For topic modelling, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-negative Matrix Factorization (NMF) were employed. Additionally, a Support Vector Machine (SVM) classifier with LSA-reduced features was applied to evaluate classification performance across various topic numbers (k-values). To address class imbalance, Synthetic Minority Oversampling Technique (SMOTE) was used. The SVM model, trained with an RBF kernel, achieved high classification performance, as evidenced by the confusion matrix, ROC curve, and classification report. The performance of models was assessed using precision, recall and F1-score. Research findings included identifying top terms from topic models and extracting term from the SVM model. The results demonstrated that LDA with Gibbs sampling, variational EM and SVM with LSA reduction outperformed other methods in terms of classification accuracy and topic coherence. The study highlights the potential of combining topic modelling and machine learning techniques for analyzing scientific literature. The findings contribute to understanding emerging themes in innate and adaptive immunity research. This work offers valuable insights for researchers and healthcare professionals by enabling efficient exploration of large-scale biomedical datasets and supporting further research on immune responses.
Received: June 10, 2025
Accepted: October 3, 2025
References
[1] R. Akbani, S. Kwek and N. Japkowicz, Applying support vector machines to imbalanced datasets, Lecture Notes in Computer Science, 2014, pp. 39-50. https://doi.org/10.1007/978-3-540-30115-8.
[2] D. M. Blei, Probabilistic topic models, Communications of the ACM 4 (2012), 77-84. Doi: 10.1145/2133806.2133826.
[3] D. M. Blei, A. Kucukelbir and J. D. Mcauliffe, Variational Inference: A Review for Statisticians, 2018, pp. 1-41. arXiv:1601.00670v9.
[4] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993-1022.
[5] L. Brusa, F. Pennoni and F. Bartolucci, Maximum likelihood estimation for discrete latent variable models via evolutionary algorithms, Statistics and Computing 34 (2024), 123. https://doi.org/10.1007/s11222-023-10358-5.
[6] S. Deerwester, S. T. Dumais, G. W. Furnas and T. K. Landauer, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41(6) (1990), 391-407.
[7] T. L. Griffiths and M. Steyvers, Finding scientific topics, The National Academy of Sciences of the USA 101(Suppl. 1) (2004), 5228-5235. https://doi.org/10.1073/pnas.0307752101.
[8] D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999), 788-791. https://doi.org/10.1038/44565.
[9] X. Luo, Efficient English text classification using selected Machine Learning Techniques, Alexandria Engineering Journal 60(3) (2021), 3401-3409.
https://doi.org/10.1016/j.aej.2021.02.009.
[10] J. N. Mandrekar, Receiver operating characteristic curve in diagnostic test assessment, Journal of Thoracic Oncology 5 (2010), 1315-1316.
[11] Michael Olalekan Ajinaja, Adebayo Olusola Adetunmbi, Chukwuemeka Christian Ugwu and Olugbemiga Solomon Popoọla, Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling, Iran Journal of Computer Science 6 (2023), 81-94.
https://doi.org/10.1007/s42044-022-00124-7.
[12] Vapnik Vladimir and Corinna Corter, Support-vector networks, Machine Learning 20 (1995), 273-297. https://doi.org/10.1007/BF00994018.
[13] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Christopher Watkins, Text classification using string kernels, Journal of Machine Learning Research 2 (2002), 419-444. 10.1162/153244302760200687.
[14] S. K. M. Jeyasree, G. Vijayasree and R. Geetha, Implementation of structural topic modelling for abstract mining from CORD-19, Lecture notes in Networks and Systems, Proceedings of the 13th International Conference on Soft Computing and Pattern Recognition 417 (2022), 347-360.
https://link.springer.com/book/10.1007/978-3-030-96302-6.
[15] M. Steyvers and T. Griffiths, Probabilistic topic models, Handbook of Latent Semantic Analysis, eBook, 2007.
[16] D. M. Titterington, Approximate Bayesian inference for simple mixtures, COMPSTAT, 2000, pp. 331-336. https://doi.org/10.1007/978-3-642-57678-2_42.
[17] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson and S. Kohlmeier, CORD-19: The COVID -19 Open Research Dataset, arXiv:2004.10706v4 [cs.DL], 2020.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Pushpa Publishing House, Prayagraj, India

This work is licensed under a Creative Commons Attribution 4.0 International License.
____________________________
Attribution: Credit Pushpa Publishing House as the original publisher, including title and author(s) if applicable.
No Derivatives: Modifying or creating derivative works not allowed without written permission.
Contact Pushpa Publishing House for more info or permissions.
Journal Impact Factor: 