Clustering US States by Time Series of COVID-19 New Case Counts in the Early Months with Non-Negative Matrix Factorization

Chen, Jianmin; Zhang, Panpan

doi:10.6339/22-JDS1036

Journal of Data Science

Clustering US States by Time Series of COVID-19 New Case Counts in the Early Months with Non-Negative Matrix Factorization

Volume 20, Issue 1 (2022), pp. 79–94

Jianmin Chen Panpan Zhang

https://doi.org/10.6339/22-JDS1036

Pub. online: 4 February 2022 Type: Data Science In Action

Open Access

Received
23 January 2022

Accepted
25 January 2022

Published
4 February 2022

Abstract

The spreading pattern of COVID-19 in the early months of the pandemic differs a lot across the states in the US under different quarantine measures and reopening policies. We proposed to cluster the US states into distinct communities based on the daily new confirmed case counts from March 22 to July 25 via a nonnegative matrix factorization (NMF) followed by a k-means clustering procedure on the coefficients of the NMF basis. A cross-validation method was employed to select the rank of the NMF. The method clustered the 49 continental states (including the District of Columbia) into 7 groups, two of which contained a single state. To investigate the dynamics of the clustering results over time, the same method was successively applied to the time periods with an increment of one week, starting from the period of March 22 to March 28. The results suggested a change point in the clustering in the week starting on May 30, caused by a combined impact of both quarantine measures and reopening policies.

Supplementary material

Supplementary Material

1. data_10_05.csv: This file contains the data from a public repository maintained by the Center for Systems Science and Engineering at the Johns Hopkins University (Dong et al., 2020). The data was retrieved on October 5, 2020. The case numbers may differ from those in the current version owing to possible modifications made after October 5, 2020. 2. nst-est2019-01.csv: This file contains the state-level population data, maintained by the US Census Bureau (https://www.census.gov). The data was released at the end of 2019. 3. pretreat.R: Codes for pre-processing the data (e.g., smoothing and scaling). 4. getnmfparameter.R: Codes for obtaining NMF ranks via the cross-validation method proposed in the paper. 5. model_fit.R: Codes for implementing the NMF method. The results of k-means clustering (including the selection of k) are given by running this file as well. 6. plotmaking.R: Codes for generating the figures in the paper.

References

Arumugadevi S, Seenivasagam V (2015). Comparison of clustering methods for segmenting color images. Indian Journal of Science and Technology, 8(7): 670.

Boutsidis C, Gallopoulos E (2008). SVD based initialization: a head start for nonnegative matrix factorization. Pattern Recognition, 41(4): 1350–1362.

Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences of the United States of America, 101(12): 4164–4169.

Chen WC, Maitra R (2015). EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution. R Package. URL http://cran.r-project.org/package=EMCluster.

Chiou JM, Li PL (2007). Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 69(4): 679–699.

Devarajan K (2008). Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLOS Computational Biology, 4(7): e1000029.

Ding C, He X, Simon HD (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the 2005 SIAM International Conference on Data Mining (H Kargupta, J Srivastava, C Kamath, A Goodman, eds.), 606–610. SIAM, Philadelphia, PA, USA.

Dong E, Du H, Gardner L (2020). An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 20(5): 533–534.

D’Urso P, De Giovanni L, Vitale V (2021a). Spatial robust fuzzy clustering of COVID-19 time series based on B-splines. Spatial Statistics, 100518. https://doi.org/10.1016/j.spasta.2021.100518.

D’Urso P, Mucciardi M, Otranto E, Vitale V (2021b). Community mobility in the European regions during COVID-19 pandemic: a partitioning around medoids with noise cluster based on space-time autoregressive models. Spatial Statistics, 100531. https://doi.org/10.1016/j.spasta.2021.100531.

Fauci AS, Lane HC, Redfield RR (2020). COVID-19—navigating the uncharted. The New England Journal of Medicine, 382(13): 1268–1269.

Gaujoux R, Seoighe C (2010). A flexible R package for nonnegative matrix factorization. BMC Bioinformatics, 11: 367.

Gelbard R, Goldman O, Spiegler I (2007). Investigating diversity of clustering methods: an empirical comparison. Data & Knowledge Engineering, 63(1): 155–166.

Goyal P, Choi JJ, Pinheiro LC, Schenck EJ, Chen R, Jabri A, et al. (2020). Clinical characteristics of COVID-19 in New York City. The New England Journal of Medicine, 382(24): 2372–2374.

Guillamet D, Bressan M, Vitria J (2001). A weighted non-negative matrix factorization for local representations. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume I, 942–947. IEEE, Piscataway, NJ, USA.

Hartigan JA, Wong MA (1979). Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1): 100–108.

Hirano S, Sun X, Tsumoto S (2004). Comparison of clustering methods for clinical databases. Information Sciences, 159(3–4): 155–165.

Hubert L, Arabie P (1985). Comparing partitions. Journal of Classification, 2: 193–218.

Jacques J, Preda C (2014). Functional data clustering: a survey. Advances in Data Analysis and Classification, 8: 231–255.

Kanagal B, Sindhwani V (2010). Rank selection in low-rank matrix approximations: a study of cross-validation for NMFs. In: Low-Rank Methods for Large-scale Machine Learning (Workshop in NIPS’10) (A Gretton, M Mahoney, M Mohri, A Talwalkar, eds.). https://www.cs.umd.edu/~bhargav/nips2010.pdf.

Kim YD, Choi S (2009). Weighted nonnegative matrix factorization. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 1541–1544. IEEE, Piscataway, NJ, USA.

Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. (2020). The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of Internal Medicine, 172(9): 577–582.

Lee DD, Seung HS (2000). Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS’00) (TK Leen, TG Dietterich, V Tresp, eds.), 535–541. MIT Press, Cambridge, MA, USA.

Li X, Zhang P, Feng Q (2022). Exploring COVID-19 in mainland China during the lockdown of Wuhan via functional data analysis. Communications for Statistical Applications and Methods. In press.

Liao TW (2005). Clustering of time series data—a survey. Pattern Recognition, 38(11): 1857–1874.

Lin J, Vlachos M, Keogh E, Gunopulos D (2004). Iterative incremental clustering of time series. In: Proceedings of the 9th International Conference on Extending Database Technology (E Bertino, S Christodoulakis, D Plexousakis, V Christophides, M Koubarakis, K Böhm, E Ferrari, eds.), 106–122. Springer-Verlag, Berlin, Heidelberg, Germany.

Madhulatha TS (2011). Comparison between k-means and k-medoids clustering algorithms. In: Advances in Computing and Information Technology (DC Wyld, M Wozniak, N Chaki, N Meghanathan, D Nagamalai, eds.), 472–481. Springer, Berlin, Heidelberg, Germany.

Moghadas SM, Shoukat A, Fitzpatrick MC, Wells CR, Sah P, Pandey A, et al. (2020). Projecting hospital utilization during the COVID-19 outbreaks in the United States. Proceedings of the National Academy of Sciences of the United States of America, 117(16): 9122–9126.

Shaw C, King G (1992). Using cluster analysis to classify time series. Physica D: Nonlinear Phenomena, 58(1–4): 288–298.

Tang C, Wang T, Zhang P (2022). Functional data analysis: an application to COVID-19 data in the United States. Quantitative Biology. In press. arXiv preprint: https://arxiv.org/abs/2009.08363.

Tian T, Tan J, Jiang Y, Wang X, Zhang H (2020). Evaluate the risk of resumption of business for the states of New York, New Jersey and Connecticut via a pre-symptomatic and asymptomatic transmission model of COVID-19. medRxiv preprint: https://doi.org/10.1101/2020.05.16.20103747.

Vitale V, D’Urso P, De Giovanni L (2021). Spatio-temporal object-oriented Bayesian network modelling of the COVID-19 Italian outbreak data. Spatial Statistics, 100529. https://doi.org/10.1016/j.spasta.2021.100529.

Wang G, Kossenkov AV, Ochs MF (2006). LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC bioinformatics, 7: 175.

Zhang P, Wang T, Xie SX (2020). Meta-analysis of several epidemic characteristics of COVID-19. Journal of Data Science, 18(3): 536–549.

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

change point COVID-19 k-means clustering non-negative matrix factorization

Metrics

since February 2021

1591

Article info
views

669

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file