Advances and Applications in Statistics

The Advances and Applications in Statistics is an internationally recognized journal indexed in the Emerging Sources Citation Index (ESCI). It provides a platform for original research papers and survey articles in all areas of statistics, both computational and experimental in nature.

Submit Article

INVESTIGATING TERM WEIGHTING SCHEMES ON THE CLASSIFICATION PERFORMANCE FOR THE IMBALANCED TEXT DATA

Authors

  • Afra Al Manei
  • Iman Al Hasani
  • Ronald Wesonga

Keywords:

term weighting, multinomial Naive Bayes, support vector machine, text analysis, research thesis.

DOI:

https://doi.org/10.17654/0972361722050

Abstract

The effect of term weighting (TW) on the classification has been found to yield better results for the text data classification problem. However, little evidence exists for the essential differences among different TW schemes on the classification performance. In this study, we present the results of an investigation of three most popular TW schemes, namely, count, term frequency-inverse document frequency (TFIDF) and term frequency-inverse category frequency (TFICF) under the multinomial Naive Bayes (MNB) and support vector machine (SVM) classification algorithms using imbalanced text data. Our results revealed that the count weighting scheme with the MNB gives a higher macro-average recall compared to the other schemes with SVM. On the other hand, the TFICF with the SVM generates a higher macro-average recall compared to the other two schemes. The findings suggest that TW schemes have different effects on classification of imbalanced text data. Whereas the count weighting scheme performs better in classifying text data using the MNB, the same count scheme with SVM seems to handle the imbalanced data issue better than the count under the MNB classifier. Therefore, our findings reveal that the effect of TW schemes on the classification performance of imbalanced text data can greatly improve when the count weighting scheme is used with MNB and the TFICF with SVM classifier, respectively. This study is significant as it recommends a benchmark for the use and application of TW schemes for the classification algorithms with imbalanced text data.

Received: April 7, 2022
Accepted: May 26, 2022

References

S. M. Alzanin, A. M. Azmi and H. A. Aboalsamh, Short text classification for Arabic social media tweets, Journal of King Saud University - Computer and Information Sciences 2022 (in press). URL: https://www.sciencedirect.com/science/article/pii/S1319157822001045, doi: https://doi.org/10.1016/j.jksuci.2022.03.020.

W. G. Cochran, Sampling Techniques, John Wiley & Sons, 2007.

F. Debole and F. Sebastiani, Supervised term weighting for automated text categorization, Text Mining and its Applications, Springer, 2004, pp. 81-97.

G. Domeniconi, G. Moro, R. Pasolini and C. Sartori, A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf, International Conference on Data Management Technologies and Applications, 2015, pp. 39-58.

S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the Seventh International Conference on Information and Knowledge Management, 1998, pp. 148-155.

G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, Volume 112, Springer, 2013.

T. Joachims, Text categorization with support vector machines: learning with many relevant features, European Conference on Machine Learning, 1998, pp. 137-142.

K. S. Jones, A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28(1) (1972), 11-21.

J. J. Jung, Exploiting geotagged resources for spatial clustering on social network services, Concurrency and Computation: Practice and Experience 28 (2016), 1356-1367.

S. Kannan and V. Gurusamy, Preprocessing techniques for text mining, International Journal of Computer Science & Communication Networks 5 (2014), 7-16.

Y. Ko, A study of term weighting schemes using class information for text classification, Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 1029-1030.

K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes and D. Brown, Text classification algorithms: a survey, Information 10 (2019), 150.

M. Lan, S.-Y. Sung, H.-B. Low and C.-L. Tan, A comparative study on term weighting schemes for text categorization, Proceedings of 2005 IEEE International Joint Conference on Neural Networks, IEEE, Volume 1, 2005, pp. 546-551.

M. Lan, C. L. Tan, J. Su and Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2008), 721-735.

J. Lever, M. Krzywinski and N. Altman, Erratum: Corrigendum: Classification evaluation, Nature Methods 13(10) (2016), 890-890.

C. D. Manning, P. Raghavan and H. Schutze, Naive Bayes text classification, Introduction to Information Retrieval, Cambridge University Press, 2008, pp. 234-265.

A. Mazyad, F. Teytaud and C. Fonlupt, A comparative study on term weighting schemes for text classification, International Workshop on Machine Learning, Optimization and Big Data, Springer, 2017, pp. 100-108.

G. Miner, J. Elder IV, A. Fast, T. Hill, R. Nisbet and D. Delen, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Academic Press, 2012.

B. Naderalvojoud and E. Akcapinar Sezer, Term evaluation metrics in imbalanced text categorization, Natural Language Engineering 26 (2020), 31-47.

doi:10.1017/S1351324919000317.

T. Pranckevicius and V. Marcinkevicius, Comparison of Naive Bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification, Baltic Journal of Modern Computing 5 (2017), 221.

C. Robert, Machine learning, a probabilistic perspective, CHANCE 27(2) (2014), 62-63.

G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1986.

D. Wang and H. Zhang, Inverse-category-frequency based supervised term weighting scheme for text categorization (2010). arXiv preprint arXiv:1012.2609.

Y. Yang and X. Liu, A re-examination of text categorization methods, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.

Published

24-09-2025

Issue

Section

Articles

How to Cite

INVESTIGATING TERM WEIGHTING SCHEMES ON THE CLASSIFICATION PERFORMANCE FOR THE IMBALANCED TEXT DATA. (2025). Advances and Applications in Statistics , 78, 63-82. https://doi.org/10.17654/0972361722050

Similar Articles

1-10 of 198

You may also start an advanced similarity search for this article.