Pub. online:27 Apr 2021Type:Philosophies Of Data Science
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 219–242
Abstract
The coronavirus disease 2019 (COVID-19) pandemic caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has placed epidemic modeling at the center of attention of public policymaking. Predicting the severity and speed of transmission of COVID-19 is crucial to resource management and developing strategies to deal with this epidemic. Based on the available data from current and previous outbreaks, many efforts have been made to develop epidemiological models, including statistical models, computer simulations, mathematical representations of the virus and its impacts, and many more. Despite their usefulness, modeling and forecasting the spread of COVID-19 remains a challenge. In this article, we give an overview of the unique features and issues of COVID-19 data and how they impact epidemic modeling and projection. In addition, we illustrate how various models could be connected to each other. Moreover, we provide new data science perspectives on the challenges of COVID-19 forecasting, from data collection, curation, and validation to the limitations of models, as well as the uncertainty of the forecast. Finally, we discuss some data science practices that are crucial to more robust and accurate epidemic forecasting.
Researchers and public officials tend to agree that until a vaccine is readily available, stopping SARS-CoV-2 transmission is the name of the game. Testing is the key to preventing the spread, especially by asymptomatic individuals. With testing capacity restricted, group testing is an appealing alternative for comprehensive screening and has recently received FDA emergency authorization. This technique tests pools of individual samples, thereby often requiring fewer testing resources while potentially providing multiple folds of speedup. We approach group testing from a data science perspective and offer two contributions. First, we provide an extensive empirical comparison of modern group testing techniques based on simulated data. Second, we propose a simple one-round method based on ${\ell _{1}}$-norm sparse recovery, which outperforms current state-of-the-art approaches at certain disease prevalence rates.
It is hypothesized that short-term exposure to air pollution may influence the transmission of aerosolized pathogens such as COVID-19. We used data from 23 provinces in Italy to build a generalized additive model to investigate the association between the effective reproductive number of the disease and air quality while controlling for ambient environmental variables and changes in human mobility. The model finds that there is a positive, nonlinear relationship between the density of particulate matter in the air and COVID-19 transmission, which is in alignment with similar studies on other respiratory illnesses.
The coronavirus disease of 2019 (COVID-19) is a pandemic. To characterize its disease transmissibility, we propose a Bayesian change point detection model using daily actively infectious cases. Our model builds on a Bayesian Poisson segmented regression model that 1) capture the epidemiological dynamics under the changing conditions caused by external or internal factors; 2) provide uncertainty estimates of both the number and locations of change points; and 3) has the potential to adjust for any time-varying covariate effects. Our model can be used to evaluate public health interventions, identify latent events associated with spreading rates, and yield better short-term forecasts.
Pub. online:23 Feb 2021Type:Statistical Data Science
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 348–364
Abstract
It is widely acknowledged that the reported numbers of infected cases with COVID-19 were not complete. A structured approach is proposed where we distinguish cases reflected later in the numbers of confirmed cases and those with mild or no symptoms thus not captured by any systems at all. The number of infected cases in the US is estimated to be 220.54% of that reported as of Apr 20, 2020. This implies an overall infection ratio of 0.53%, and a case mortality rate at 2.85% which is close to the 3.4% suggested by WHO in March 2020.
Pub. online:22 Feb 2021Type:Data Science In Action
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 334–347
Abstract
Coronavirus and the COVID-19 pandemic have substantially altered the ways in which people learn, interact, and discover information. In the absence of everyday in-person interaction, how do people self-educate while living in isolation during such times? More specifically, do communities emerge in Google search trends related to coronavirus? Using a suite of network and community detection algorithms, we scrape and mine all Google search trends in America related to an initial search for “coronavirus,” starting with the first Google search on the term (January 16, 2020) to recently (August 11, 2020). Results indicate a near-constant shift in the structure of how people educate themselves on coronavirus. Queries in the earliest days focusing on “Wuhan” and “China”, then shift to “stimulus checks” at the height of the virus in the U.S., and finally shift to queries related to local surges of new cases in later days. A few communities emerge surrounding terms more overtly related to coronavirus (e.g., “cases”, “symptoms”, etc.). Yet, given the shift in related Google queries and the broader information environment, clear community structure for the full search space does not emerge.
Pub. online:22 Feb 2021Type:COVID-19 Special Issue
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 314–333
Abstract
As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein’s 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.
Pub. online:22 Feb 2021Type:Computing In Data Science
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 293–313
Abstract
The COVID-19 (COrona VIrus Disease 2019) pandemic has had profound global consequences on health, economic, social, behavioral, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of an artificial intelligence enhanced COVID-19 analysis (in short AICov), which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on Long Short-Term Memory (LSTM) and event modeling. To demonstrate our approach, we have introduced a framework that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population’s socioeconomic, health, and behavioral risk factors at their specific locations. The compiled data are fed into AICov, and thus we obtain improved prediction by the integration of the data to our model as compared to one that only uses case and death data. As we use deep learning our models adapt over time while learning the model from past data.
A large volume of trajectory data collected from human beings and vehicle mobility is highly sensitive due to privacy concerns. Therefore, generating synthetic and plausible trajectory data is pivotal in many location-based studies and applications. But existing LSTM-based methods are not suitable for modeling large-scale sequences due to gradient vanishing problem. Also, existing GAN-based methods are coarse-grained. Considering the trajectory’s geographical and sequential features, we propose a map-based Two-Stage GAN method (TSG) to tackle the challenges above and generate fine-grained and plausible large-scale trajectories. In the first stage, we first transfer GPS points data to discrete grid representation as the input for a modified deep convolutional generative adversarial network to learn the general pattern. In the second stage, inside each grid, we design an effective encoder-decoder network as the generator to extract road information from map image and then embed it into two parallel Long Short-Term Memory networks to generate GPS point sequences. Discriminator conditioned on encoded map image restrains generated point sequences in case they deviate from corresponding road networks. Experiments on real-world data are conducted to prove the effectiveness of our model in preserving geographical features and hidden mobility patterns. Moreover, our generated trajectories not only indicate the distribution similarity but also show satisfying road network matching accuracy.
One of the key features in regression models consists in selecting appropriate characteristics that explain the behavior of the response variable, in which stepwise-based procedures occupy a prominent position. In this paper we performed several simulation studies to investigate whether a specific stepwise-based approach, namely Strategy A, properly selects authentic variables into the generalized additive models for location, scale and shape framework, considering Gaussian, zero inflated Poisson and Weibull distributions. Continuous (with linear and nonlinear relationships) and categorical explanatory variables are considered and they are selected through some goodness-of-fit statistics. Overall, we conclude that the Strategy A greatly performed.