Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 849–859
Abstract
Millions of people travel from Wuhan to other cities from Jan. 1st 2020 to Jan 23rd 2020. Taking advantage of the masked software development kit data from Aurora Mobile Ltd and open epidemic data released by health authorities, we analyze the relationship between number of confirmed COVID-19 cases in a region and the people who traveled from Wuhan to this region in this period. Further, we identify high risk carriers of COVID-19 to improve the control of COVID-19. The key findings are three-folds: (1) in each region the number of high-risk carriers is highly positively correlated with the severity of illness; (2) history of visit to the 62 designated hospitals is the foremost index of risk; (3) the second most important index is the travelers’ duration of stay in Wuhan. Based on our analysis, we estimate that, as of February 4, 2020, (a) among the 8.5 million people held up in Wuhan, there are 425 thousand high risk carriers; and (b) among all the 3.5 million migrant workers held up in Hubei, there are 175 thousand high risk carriers. The disease control authorities should closely minotor these groups.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 3 (2020): Special issue: Data Science in Action in Response to the Outbreak of COVID-19, pp. 511–525
Abstract
Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a data science perspective to study the potential structural impacts due to ongoing mutations in its amino acid sequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sampling method to detect possible changes to the space of lowenergy conformations for different amino acid sequences. Such computational approaches can further our understanding of this protein structure and complement laboratory efforts.
Researchers and public officials tend to agree that until a vaccine is readily available, stopping SARS-CoV-2 transmission is the name of the game. Testing is the key to preventing the spread, especially by asymptomatic individuals. With testing capacity restricted, group testing is an appealing alternative for comprehensive screening and has recently received FDA emergency authorization. This technique tests pools of individual samples, thereby often requiring fewer testing resources while potentially providing multiple folds of speedup. We approach group testing from a data science perspective and offer two contributions. First, we provide an extensive empirical comparison of modern group testing techniques based on simulated data. Second, we propose a simple one-round method based on ${\ell _{1}}$-norm sparse recovery, which outperforms current state-of-the-art approaches at certain disease prevalence rates.
Pub. online:22 Feb 2021Type:COVID-19 Special Issue
Journal:Journal of Data Science
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 314–333
Abstract
As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein’s 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.