Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic
  4. Statistical Challenges in the Analysis o ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Statistical Challenges in the Analysis of Sequence and Structure Data for the COVID-19 Spike Protein
Volume 19, Issue 2 (2021): Special issue: Continued Data Science Contributions to COVID-19 Pandemic, pp. 314–333
Shiyu He   Samuel W.K. Wong  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS1006
Pub. online: 22 February 2021      Type: COVID-19 Special Issue      Open accessOpen Access

Received
31 December 2020
Accepted
18 January 2021
Published
22 February 2021

Abstract

As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein’s 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.

Supplementary material

 Supplementary Material
The processed data, R code, and instructions for reproducing the results in this paper are provided in a supplementary .zip file.

References

 
Aitchison J (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society, Series B, Methodological, 44(2): 139–160.
 
Aitchison J (1999). Logratios and natural laws in compositional data analysis. Mathematical Geology, 31(5): 563–580.
 
Amanat F, Krammer F (2020). SARS-CoV-2 vaccines: Status report. Immunity, 52(4): 583–589.
 
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, et al. (1977). The protein data bank. European Journal of Biochemistry, 80(2): 319–324.
 
Botev ZI, Grotowski JF, Kroese DP, et al. (2010). Kernel density estimation via diffusion. The Annals of Statistics, 38(5): 2916–2957.
 
Cargnoni C, Müller P, West M (1997). Bayesian forecasting of multinomial time series through conditionally Gaussian dynamic models. Journal of the American Statistical Association, 92(438): 640–647.
 
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. (2017). STAN: A probabilistic programming language. Journal of Statistical Software, 76(1): 1–32.
 
Chen AT, Altschuler K, Zhan SH, Chan YA, Deverman BE (2020a). COVID-19 CG: Tracking SARS-CoV-2 mutations by locations and dates of interest. bioRxiv preprint: https://doi.org/10.1101/2020.09.23.310565.
 
Chen J, Wang R, Wang M, Wei G-W (2020b). Mutations strengthened SARS-CoV-2 infectivity. Journal of Molecular Biology, 432(19): 5212–5226.
 
Diehl WE, Lin AE, Grubaugh ND, Carvalho LM, Kim K, Kyawe PP, et al. (2016). Ebola virus glycoprotein with increased infectivity dominated the 2013–2016 epidemic. Cell, 167(4): 1088–1098.
 
Dong E, Du H, Gardner L (2020). An interactive web-based dashboard to track COVID-19 in real time. Lancet. Infectious Diseases, 20(5): 533–534.
 
Duffy S (2018). Why are RNA virus mutation rates so damn high? PLoS Biology, 16(8): e3000003.
 
European Commission (2020). Coronavirus: Commission proposes more clarity and predictability of any measures restricting free movement in the European Union. https://ec.europa.eu/commission/presscorner/detail/en/ip_20_1555. Last checked on Dec 20, 2020.
 
Hodcroft EB, Zuber M, Nadeau S, Comas I, Candelas FG, Stadler T, et al. (2020). Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv preprint: https://doi.org/10.1101/2020.10.25.20219063.
 
Huddleston J, Barnes JR, Rowe T, Xu X, Kondor R, Wentworth DE, et al. (2020). Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza A/H3N2 evolution. eLife, 9:e60067.
 
Korber B, Fischer W, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. (2020). Tracking changes in SARS-CoV-2 Spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell, 182(4): 812–827.
 
Krammer F (2020). SARS-CoV-2 vaccines in development. Nature, 586(7830): 516–527.
 
Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J (2019). Critical assessment of methods of protein structure prediction (CASP) — Round XIII. Proteins: Structure, Function, and Bioinformatics, 87(12): 1011–1020.
 
Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. (2020). The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application. Annals of Internal Medicine, 172(9): 577–582.
 
Lauring AS, Andino R (2010). Quasispecies theory and the behavior of RNA viruses. PLoS Pathogens, 6(7): e1001005.
 
Li Q, Wu J, Nie J, Zhang L, Hao H, Liu S, et al. (2020). The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell, 182(5): 1284–1294.
 
Ning T, Nie J, Huang W, Li C, Li X, Liu Q, et al. (2019). Antigenic drift of influenza a (H7N9) virus hemagglutinin. The Journal of Infectious Diseases, 219(1): 19–25.
 
Onuchic JN, Luthey-Schulten Z, Wolynes PG (1997). Theory of protein folding: The energy landscape perspective. Annual Review of Physical Chemistry, 48(1): 545–600.
 
Phan T (2020). Novel coronavirus: From discovery to clinical diagnostics. Infection, Genetics and Evolution, 79: 104211.
 
Schaefer C, Rost B (2012). Predict impact of single amino acid change upon protein structure. BMC Genomics, 13(S4): 1–10.
 
Sedova M, Jaroszewski L, Alisoltani A, Godzik A (2020). Coronavirus3d: 3d structural visualization of COVID-19 genomic divergence. Bioinformatics, 36(15): 4360–4362.
 
Tang K, Zhang J, Liang J (2014). Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLoS Computational Biology, 10: e1003539.
 
Toyoshima Y, Nemoto K, Matsumoto S, Nakamura Y, Kiyotani K (2020). SARS-CoV-2 genomic variations associated with mortality rate of COVID-19. Journal of Human Genetics, 65: 1075–1082.
 
Wan Y, Shang J, Graham R, Baric RS, Li F (2020). Receptor recognition by the novel coronavirus from Wuhan: An analysis based on decade-long structural studies of SARS coronavirus. Journal of Virology, 94(7): e00127-20.
 
Ward JH (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301): 236–244.
 
Webb B, Sali A (2017). Protein structure modeling with MODELLER. In: Functional Genomics (M Kaufmann, C Klinger, A Savelsbergh, eds.), 39–54. Springer.
 
WHO (2020a). Coronavirus disease (COVID-19) situation dashboard. https://who.sprinklr.com/. Last checked on Dec 19, 2020.
 
WHO (2020b). Draft landscape of COVID-19 candidate vaccines. https://www.who.int/publications/m/item/draft-landscape-of-covid-19-candidate-vaccines. Last checked on Dec 20, 2020.
 
Wong SW (2020). Assessing the impacts of mutations to the structure of COVID-19 spike protein via sequential Monte Carlo. Journal of Data Science, 18(3): 511–525.
 
Wong SW, Liu JS, Kou S (2018). Exploring the conformational space for protein folding with sequential Monte Carlo. Annals of Applied Statistics, 12(3): 1628–1654.
 
Wrapp D, Wang N, Corbett KS, Goldsmith JA, Hsieh C-L, Abiona O, et al. (2020). Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science, 367(6483): 1260–1263.
 
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, et al. (2020). A new coronavirus associated with human respiratory disease in China. Nature, 579(7798): 265–269.
 
Yurkovetskiy L, Wang X, Pascal KE, Tomkins-Tinch C, Nyalile TP, Wang Y, et al. (2020). Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant. Cell, 183(3): 739–751.
 
Zhang J, Kou SC, Liu JS (2007a). Biopolymer structure simulation and optimization via fragment regrowth Monte Carlo. Journal of Chemical Physics, 126(22): 06B605.
 
Zhang J, Lin M, Chen R, Liang J, Liu JS (2007b). Monte Carlo sampling of near-native structures of proteins with applications. Proteins: Structure, Function, and Bioinformatics, 66(1): 61–68.
 
Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, et al. (2020a). A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 579(7798): 270–273.
 
Zhou T, Tsybovsky Y, Olia AS, Gorman J, Rapp M, Cerutti G, et al. (2020b). Cryo-EM structures delineate a ph-dependent switch that mediates endosomal positioning of SARS-CoV-2 spike receptor-binding domains. bioRxiv preprint: https://doi.org/10.1101/2020.07.04.187989.

Related articles PDF XML
Related articles PDF XML

Copyright
2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
Bayesian hierarchical models compositional data analysis conformational sampling mutant clusters SARS-CoV-2

Metrics
since February 2021
1264

Article info
views

834

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy