Clustering US States by Time Series of COVID-19 New Case Counts in the Early Months with Non-Negative Matrix Factorization
Volume 20, Issue 1 (2022), pp. 79–94
Pub. online: 4 February 2022
Type: Data Science In Action
Open Access
Received
23 January 2022
23 January 2022
Accepted
25 January 2022
25 January 2022
Published
4 February 2022
4 February 2022
Abstract
The spreading pattern of COVID-19 in the early months of the pandemic differs a lot across the states in the US under different quarantine measures and reopening policies. We proposed to cluster the US states into distinct communities based on the daily new confirmed case counts from March 22 to July 25 via a nonnegative matrix factorization (NMF) followed by a k-means clustering procedure on the coefficients of the NMF basis. A cross-validation method was employed to select the rank of the NMF. The method clustered the 49 continental states (including the District of Columbia) into 7 groups, two of which contained a single state. To investigate the dynamics of the clustering results over time, the same method was successively applied to the time periods with an increment of one week, starting from the period of March 22 to March 28. The results suggested a change point in the clustering in the week starting on May 30, caused by a combined impact of both quarantine measures and reopening policies.
Supplementary material
Supplementary Material
1.
data_10_05.csv: This file contains the data from a public repository maintained by the Center for Systems Science and Engineering at the Johns Hopkins University (Dong et al., 2020). The data was retrieved on October 5, 2020. The case numbers may differ from those in the current version owing to possible modifications made after October 5, 2020.
2.
nst-est2019-01.csv: This file contains the state-level population data, maintained by the US Census Bureau (https://www.census.gov). The data was released at the end of 2019.
3.
pretreat.R: Codes for pre-processing the data (e.g., smoothing and scaling).
4.
getnmfparameter.R: Codes for obtaining NMF ranks via the cross-validation method proposed in the paper.
5.
model_fit.R: Codes for implementing the NMF method. The results of k -means clustering (including the selection of k ) are given by running this file as well.
6.
plotmaking.R: Codes for generating the figures in the paper.
References
Chen WC, Maitra R (2015). EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution. R Package. URL http://cran.r-project.org/package=EMCluster.
D’Urso P, De Giovanni L, Vitale V (2021a). Spatial robust fuzzy clustering of COVID-19 time series based on B-splines. Spatial Statistics, 100518. https://doi.org/10.1016/j.spasta.2021.100518.
D’Urso P, Mucciardi M, Otranto E, Vitale V (2021b). Community mobility in the European regions during COVID-19 pandemic: a partitioning around medoids with noise cluster based on space-time autoregressive models. Spatial Statistics, 100531. https://doi.org/10.1016/j.spasta.2021.100531.
Kanagal B, Sindhwani V (2010). Rank selection in low-rank matrix approximations: a study of cross-validation for NMFs. In: Low-Rank Methods for Large-scale Machine Learning (Workshop in NIPS’10) (A Gretton, M Mahoney, M Mohri, A Talwalkar, eds.). https://www.cs.umd.edu/~bhargav/nips2010.pdf.
Lin J, Vlachos M, Keogh E, Gunopulos D (2004). Iterative incremental clustering of time series. In: Proceedings of the 9th International Conference on Extending Database Technology (E Bertino, S Christodoulakis, D Plexousakis, V Christophides, M Koubarakis, K Böhm, E Ferrari, eds.), 106–122. Springer-Verlag, Berlin, Heidelberg, Germany.
Tang C, Wang T, Zhang P (2022). Functional data analysis: an application to COVID-19 data in the United States. Quantitative Biology. In press. arXiv preprint: https://arxiv.org/abs/2009.08363.
Tian T, Tan J, Jiang Y, Wang X, Zhang H (2020). Evaluate the risk of resumption of business for the states of New York, New Jersey and Connecticut via a pre-symptomatic and asymptomatic transmission model of COVID-19. medRxiv preprint: https://doi.org/10.1101/2020.05.16.20103747.
Vitale V, D’Urso P, De Giovanni L (2021). Spatio-temporal object-oriented Bayesian network modelling of the COVID-19 Italian outbreak data. Spatial Statistics, 100529. https://doi.org/10.1016/j.spasta.2021.100529.