CLUSTERING OF COUNT DATA USING POISSON DISTRIBUTION
Keywords:
model-based clustering, information criteria, expectation - maximization algorithm, Poisson distributionDOI:
https://doi.org/10.17654/0972361725053Abstract
Cluster analysis is often used to identify homogeneous groups within complex datasets, particularly when traditional distance-based methods struggle with high-dimensional or skewed data. In this study, we propose a model-based clustering approach for count data using a finite mixture of Poisson distributions. The model accounts for overdispersion and skewness, with parameters estimated via the expectation-maximization (EM) algorithm. Information criteria such as AIC and BIC are employed for model selection. A key novelty of this work lies in applying Poisson mixture models to a large-scale health survey dataset, specifically the behavioral risk factor surveillance system (BRFSS), treating BMI as discrete count data. The proposed method is also benchmarked against lognormal mixture models, demonstrating superior performance in terms of misclassification rate and adjusted rand index (ARI). Additionally, the impact of initialization strategies on EM convergence is examined using both real and simulated datasets. Results confirm that Poisson mixture-based clustering offers a more effective and interpretable solution for count data than traditional approaches.
Received: September 28, 2024
Accepted: May 22, 2025
References
B. S. Everitt and D. J. Hand, Finite Mixture Distributions, Chapman & Hall, London, 1981. https://doi.org/10.1007/978-94-009-5897-5.
I. C. Gormley, T. B. Murphy and A. E. Raftery, Model-based clustering, Annual Review of Statistics and its Application 10 (2023), 573-595.
https://doi.org/10.1146/annurev-statistics-033121-115326.
D. Karlis and L. Meligkotsidou, Finite mixtures of multivariate Poisson distributions with application, Journal of Statistical Planning and Inference 137(6) (2007), 1942-1960. https://doi.org/10.1016/j.jspi.2006.07.001.
V. Melnykov and R. Maitra, Finite mixture models and model-based clustering, Statistics Surveys, Statist. Surv. 4(none), 2010.
https://doi.org/10.1146/annurev-statistics-033121-115326.
Y. Pan, J. T. Landis, R. Moorad, D. Wu, J. S. Marron and D. P. Dittmer, The Poisson distribution model fits UMI-based single-cell RNA-sequencing data, BMC Bioinformatics 24(1) (2023), 256.
https://doi.org/10.1186/s12859-023-05349-2.
R. K. Sheth, The generalized Poisson distribution and a model of clustering from Poisson initial conditions, Monthly Notices of the Royal Astronomical Society 299(1) (1998), 207-217.
https://ui.adsabs.harvard.edu/link_gateway/1998MNRAS.299..207S/doi:10.1046 /j.1365-8711.1998.01756.x.
A. Silva, S. J. Rothstein, P. D. McNicholas and S. Subedi, A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data, BMC Bioinformatics 20(1) (2019), 1-11.
https://doi.org/10.1186/s12859-019-2916-0.
N. Wang, Y. Wang, H. Hao, L. Wang, Z. Wang, J. Wang and R. Wu, A bi-Poisson model for clustering gene expression profiles by RNA-seq, Briefings in Bioinformatics 15(4) (2014), 534-541. https://doi.org/10.1093/bib/bbt029.
D. M. Witten, Classification and clustering of sequencing data using a Poisson model, Annals of Applied Statistics 5(4) (2011), 2493-2518.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Pushpa Publishing House, Prayagraj, India

This work is licensed under a Creative Commons Attribution 4.0 International License.
____________________________
Attribution: Credit Pushpa Publishing House as the original publisher, including title and author(s) if applicable.
No Derivatives: Modifying or creating derivative works not allowed without written permission.
Contact Pushpa Publishing House for more info or permissions.
Journal Impact Factor: 