Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic
  4. Statistical Challenges in the Analysis o ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Statistical Challenges in the Analysis of Sequence and Structure Data for the COVID-19 Spike Protein
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 314–333
Shiyu He   Samuel W.K. Wong  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS1006
Pub. online: 22 February 2021      Type: COVID-19 Special Issue     

Received
31 December 2020
Accepted
18 January 2021
Published
22 February 2021

Abstract

As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein’s 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.

Supplementary material

 Supplementary Material
The processed data, R code, and instructions for reproducing the results in this paper are provided in a supplementary .zip file.

References

 
Aitchison J (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society, Series B, Methodological, 44(2): 139–160.
 
Aitchison J (1999). Logratios and natural laws in compositional data analysis. Mathematical Geology, 31(5): 563–580.
 
Amanat F, Krammer F (2020). SARS-CoV-2 vaccines: Status report. Immunity, 52(4): 583–589.
 
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, et al. (1977). The protein data bank. European Journal of Biochemistry, 80(2): 319–324.
 
Botev ZI, Grotowski JF, Kroese DP, et al. (2010). Kernel density estimation via diffusion. The Annals of Statistics, 38(5): 2916–2957.
 
Cargnoni C, Müller P, West M (1997). Bayesian forecasting of multinomial time series through conditionally Gaussian dynamic models. Journal of the American Statistical Association, 92(438): 640–647.
 
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. (2017). STAN: A probabilistic programming language. Journal of Statistical Software, 76(1): 1–32.
 
Chen AT, Altschuler K, Zhan SH, Chan YA, Deverman BE (2020a). COVID-19 CG: Tracking SARS-CoV-2 mutations by locations and dates of interest. bioRxiv preprint: https://doi.org/10.1101/2020.09.23.310565.
 
Chen J, Wang R, Wang M, Wei G-W (2020b). Mutations strengthened SARS-CoV-2 infectivity. Journal of Molecular Biology, 432(19): 5212–5226.
 
Diehl WE, Lin AE, Grubaugh ND, Carvalho LM, Kim K, Kyawe PP, et al. (2016). Ebola virus glycoprotein with increased infectivity dominated the 2013–2016 epidemic. Cell, 167(4): 1088–1098.
 
Dong E, Du H, Gardner L (2020). An interactive web-based dashboard to track COVID-19 in real time. Lancet. Infectious Diseases, 20(5): 533–534.
 
Duffy S (2018). Why are RNA virus mutation rates so damn high? PLoS Biology, 16(8): e3000003.
 
European Commission (2020). Coronavirus: Commission proposes more clarity and predictability of any measures restricting free movement in the European Union. https://ec.europa.eu/commission/presscorner/detail/en/ip_20_1555. Last checked on Dec 20, 2020.
 
Hodcroft EB, Zuber M, Nadeau S, Comas I, Candelas FG, Stadler T, et al. (2020). Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv preprint: https://doi.org/10.1101/2020.10.25.20219063.
 
Huddleston J, Barnes JR, Rowe T, Xu X, Kondor R, Wentworth DE, et al. (2020). Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza A/H3N2 evolution. eLife, 9:e60067.
 
Korber B, Fischer W, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. (2020). Tracking changes in SARS-CoV-2 Spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell, 182(4): 812–827.
 
Krammer F (2020). SARS-CoV-2 vaccines in development. Nature, 586(7830): 516–527.
 
Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J (2019). Critical assessment of methods of protein structure prediction (CASP) — Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12): 1011–1020.
 
Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. (2020). The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application. Annals of Internal Medicine, 172(9): 577–582.
 
Lauring AS, Andino R (2010). Quasispecies theory and the behavior of RNA viruses. PLoS Pathogens, 6(7): e1001005.
 
Li Q, Wu J, Nie J, Zhang L, Hao H, Liu S, et al. (2020). The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell, 182(5): 1284–1294.
 
Ning T, Nie J, Huang W, Li C, Li X, Liu Q, et al. (2019). Antigenic drift of influenza a (H7N9) virus hemagglutinin. The Journal of Infectious Diseases, 219(1): 19–25.
 
Onuchic JN, Luthey-Schulten Z, Wolynes PG (1997). Theory of protein folding: The energy landscape perspective. Annual Review of Physical Chemistry, 48(1): 545–600.
 
Phan T (2020). Novel coronavirus: From discovery to clinical diagnostics. Infection, Genetics and Evolution, 79: 104211.
 
Schaefer C, Rost B (2012). Predict impact of single amino acid change upon protein structure. BMC Genomics, 13(S4): 1–10.
 
Sedova M, Jaroszewski L, Alisoltani A, Godzik A (2020). Coronavirus3d: 3d structural visualization of COVID-19 genomic divergence. Bioinformatics, 36(15): 4360–4362.
 
Tang K, Zhang J, Liang J (2014). Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLoS Computational Biology, 10: e1003539.
 
Toyoshima Y, Nemoto K, Matsumoto S, Nakamura Y, Kiyotani K (2020). SARS-CoV-2 genomic variations associated with mortality rate of COVID-19. Journal of Human Genetics, 65: 1075–1082.
 
Wan Y, Shang J, Graham R, Baric RS, Li F (2020). Receptor recognition by the novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS coronavirus. Journal of Virology, 94(7): e00127-20.
 
Ward JH (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301): 236–244.
 
Webb B, Sali A (2017). Protein structure modeling with MODELLER. In: Functional Genomics (M Kaufmann, C Klinger, A Savelsbergh, eds.), 39–54. Springer.
 
WHO (2020a). Coronavirus disease (COVID-19) situation dashboard. https://who.sprinklr.com/. Last checked on Dec 19, 2020.
 
WHO (2020b). Draft landscape of COVID-19 candidate vaccines. https://www.who.int/publications/m/item/draft-landscape-of-covid-19-candidate-vaccines. Last checked on Dec 20, 2020.
 
Wong SW (2020). Assessing the impacts of mutations to the structure of COVID-19 spike protein via sequential Monte Carlo. Journal of Data Science, 18(3): 511–525.
 
Wong SW, Liu JS, Kou S (2018). Exploring the conformational space for protein folding with sequential Monte Carlo. Annals of Applied Statistics, 12(3): 1628–1654.
 
Wrapp D, Wang N, Corbett KS, Goldsmith JA, Hsieh C-L, Abiona O, et al. (2020). Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science, 367(6483): 1260–1263.
 
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, et al. (2020). A new coronavirus associated with human respiratory disease in China. Nature, 579(7798): 265–269.
 
Yurkovetskiy L, Wang X, Pascal KE, Tomkins-Tinch C, Nyalile TP, Wang Y, et al. (2020). Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant. Cell, 183(3): 739–751.
 
Zhang J, Kou SC, Liu JS (2007a). Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. Journal of Chemical Physics, 126(22): 06B605.
 
Zhang J, Lin M, Chen R, Liang J, Liu JS (2007b). Monte Carlo sampling of near-native structures of proteins with applications. Proteins: Structure, Function, and Bioinformatics, 66(1): 61–68.
 
Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, et al. (2020a). A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 579(7798): 270–273.
 
Zhou T, Tsybovsky Y, Olia AS, Gorman J, Rapp M, Cerutti G, et al. (2020b). Cryo-EM structures delineate a ph-dependent switch that mediates endosomal positioning of SARS-CoV-2 spike receptor-binding domains. bioRxiv preprint: https://doi.org/10.1101/2020.07.04.187989.

Related articles PDF XML
Related articles PDF XML

Copyright
© 2021 The Author(s)
This is a free to read article.

Keywords
Bayesian hierarchical models compositional data analysis conformational sampling mutant clusters SARS-CoV-2

Metrics
since February 2021
1003

Article info
views

703

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy