Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 3 (2021)
  4. Mutstats: An Ultra-fast Computational Me ...

Journal of Data Science

Submit your article Information
  • Article info
  • More
    Article info

Mutstats: An Ultra-fast Computational Method to Determine Clonal Status of Somatic Mutations
Volume 19, Issue 3 (2021), pp. 465–484
Dehua Bi   Subhajit Sengupta   Tianjian Zhou     All authors (4)

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS1016
Pub. online: 1 June 2021      Type: Data Science In Action     

Received
23 January 2021
Accepted
16 May 2021
Published
1 June 2021

Abstract

Tumor cell population is a mixture of heterogeneous cell subpopulations, known as subclones. Identification of clonal status of mutations, i.e., whether a mutation occurs in all tumor cells or in a subset of tumor cells, is crucial for understanding tumor progression and developing personalized treatment strategies. We make three major contributions in this paper: (1) we summarize terminologies in the literature based on a unified mathematical representation of subclones; (2) we develop a simulation algorithm to generate hypothetical sequencing data that are akin to real data; and (3) we present an ultra-fast computational method, Mutstats, to infer clonal status of somatic mutations from sequencing data of tumors. The inference is based on a Gaussian mixture model for mutation multiplicities. To validate Mutstats, we evaluate its performance on simulated datasets as well as two breast carcinoma samples from The Cancer Genome Atlas project.

Supplementary material

 Supplementary Material
We include an Appendix on the Bayes model used by the PyClone method. In addition, the simulation data can be obtained from the website https://compgenome.shinyapps.io/tumorsim. Finally, the code of the Mutstats method and the real data used in this analysis can be found in the author’s Github page https://github.com/edwardbi/Mutstats.

References

 
Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT (2011). Bamtools: a C++ api and toolkit for analyzing and managing bam files. Bioinformatics, 27(12): 1691–1692.
 
Beerenwinkel N, Schwarz RF, Gerstung M, Markowetz F (2014). Cancer evolution: Mathematical models and computational inference. Systematic Biology, 64(1): e1–e25.
 
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5): 413–421.
 
Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3): 213–219.
 
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2010). The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants. Nucleic Acids Research, 38(6): 1767–1771.
 
Dempster AP, Laird NM, Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, Methodological, 39(1): 1–38.
 
Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q (2015). PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biology, 16(1): 35.
 
Fan Y, Xi L, Hughes DS, Zhang J, Zhang J, Futreal PA, et al. (2016). MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biology, 17(1): 178.
 
Fraley C, Raftery A, Scrucca L (2016). mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation. URL https://CRAN.R-project.org/package=mclust. R package version, 5: 1.
 
Huang W, Li L, Myers JR, Marth GT (2012). Art: a next-generation sequencing read simulator. Bioinformatics, 28(4): 593–594.
 
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3): 568–576.
 
Landau DA, Carter SL, Stojanov P, McKenna A, Stevenson K, Lawrence MS, et al. (2013). Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell, 152(4): 714–726.
 
Lee J, Müller P, Sengupta S, Gulukota K, Ji Y (2016). Bayesian inference for intratumour heterogeneity in mutations and copy number variation. Journal of the Royal Statistical Society: Series C (Applied Statistics), 65(4): 547–563.
 
Marusyk A, Polyak K (2010). Tumor heterogeneity: causes and consequences. Biochimica et Biophysica Acta (BBA) – Reviews on Cancer, 1805(1): 105–117.
 
McGranahan N, Favero F, de Bruin EC, Birkbak NJ, Szallasi Z, Swanton C (2015). Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Science Translational Medicine, 7(283): 283ra54–283ra54.
 
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9): 1297–1303.
 
Melnykov V, Maitra R, et al. (2010). Finite mixture models and model-based clustering. Statistics Surveys, 4: 80–116.
 
Misale S, Yaeger R, Hobor S, Scala E, Janakiraman M, Liska D, et al. (2012). Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature, 486(7404): 532–536.
 
Nik-Zainal S, Van Loo P, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, et al. (2012). The life history of 21 breast cancers. Cell, 149(5): 994–1007.
 
Nowell PC (1976). The clonal evolution of tumor cell populations. Science, 194(4260): 23–28.
 
Qin M, Liu B, Conroy JM, Morrison CD, Hu Q, Cheng Y, et al. (2015). Scnvsim: somatic copy number variation and structure variation simulator. BMC Bioinformatics, 16(1): 1–6.
 
Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al. (2014). PyClone: statistical inference of clonal population structure in cancer. Nature Methods, 11(4): 396–398.
 
Schmitt MW, Loeb LA, Salk JJ (2016). The influence of subclonal resistance mutations on targeted cancer therapy. Nature Reviews. Clinical Oncology, 13(6): 335–347.
 
Schwarz G, et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2): 461–464.
 
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: Clustering, classification and density estimation using gaussian finite mixture models. The R Journal, 8(1): 289.
 
Sengupta S, Wang J, Lee J, Müller P, Gulukota K, Banerjee A, et al. (2015). Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data. In: Pacific Symposium on Biocomputing, volume 20, 467.
 
Shcherbina A (2014). Fastqsim: platform-independent data characterization and in silico read generation for ngs datasets. BMC Research Notes, 7(1): 1–12.
 
Shen R, Seshan VE (2016). FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Research, 44(16): e131–e131.
 
Swanton C (2012). Intratumor heterogeneity: evolution through space and time. Cancer Research, 72(19): 4875–4882.
 
The I, et al. (G of Whole TPCA, Consortium) (2020). Pan-cancer analysis of whole genomes. Nature, 578(7793): 82.
 
Xia Y, Liu Y, Deng M, Xi R (2017). Pysim-sv: a package for simulating structural variation data with gc-biases. BMC Bioinformatics, 18(3): 23–30.
 
Yates LR, Campbell PJ (2012). Evolution of the cancer genome. Nature Reviews. Genetics, 13(11): 795–806.
 
Yu Z, Du F, Ban R, Zhang Y (2020). Simuscop: reliably simulate illumina sequencing data based on position and context dependent profiles. BMC Bioinformatics, 21(1): 1–18.
 
Zhou T, Müller P, Sengupta S, Ji Y (2019). PairClone: a Bayesian subclone caller based on mutation pairs. Journal of the Royal Statistical Society. Series C. Applied Statistics, 68(3): 705–725.
 
Zhou T, Sengupta S, Müller P, Ji Y (2020). RNDClone: Tumor subclone reconstruction based on integrating DNA and RNA sequence data. Annals of Applied Statistics, 14(4): 1856–1877.

PDF XML
PDF XML

Copyright
© 2021 The Author(s)
This is a free to read article.

Keywords
cancer genomics next-generation sequencing subclone tumor heterogeneity

Funding
Yuan Ji’s research is partly supported by NIH R01 CA132897.

Metrics
since February 2021
1380

Article info
views

648

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy