Mutstats: An Ultra-fast Computational Method to Determine Clonal Status of Somatic Mutations

Tumor cell population is a mixture of heterogeneous cell subpopulations, known as subclones. Identiﬁcation of clonal status of mutations, i.e., whether a mutation occurs in all tumor cells or in a subset of tumor cells, is crucial for understanding tumor progression and developing personalized treatment strategies. We make three major contributions in this paper: (1) we summarize terminologies in the literature based on a uniﬁed mathematical representation of subclones; (2) we develop a simulation algorithm to generate hypothetical sequencing data that are akin to real data; and (3) we present an ultra-fast computational method, Mutstats, to infer clonal status of somatic mutations from sequencing data of tumors. The inference is based on a Gaussian mixture model for mutation multiplicities. To validate Mutstats, we evaluate its performance on simulated datasets as well as two breast carcinoma samples from The Cancer Genome Atlas project.


Introduction
Tumor cell population is known to be a mixture of heterogeneous cell subpopulations (Nowell, 1976;Marusyk and Polyak, 2010;Swanton, 2012;Yates and Campbell, 2012). Each subpopulation is characterized by its distinct genome and is referred to as a subclone. With the advancement of next generation sequencing (NGS) technology, whole-genome or whole-exome sequencing data have enabled researchers to study the genomic profile of tumor subclones in detail within the same patient or among different patients. In particular, data of sequence variations such as single nucleotide variants (SNVs) and structural variations such as copy number aberrations (CNAs) are widely used for subclone inference.
For a heterogeneous tumor sample with multiple subclones, understanding of its mutational profile, such as the clonal status of the mutations, is crucial for accurate disease prognosis and precision medicine. A mutation is called clonal if it occurs across all the tumor cells. On the other hand, a mutation is called subclonal if it only occurs in a subset of tumor cells. Figure 1(a) provides a stylized illustration of tumor heterogeneity and clonal and subclonal mutations. The "G" mutation at the first locus and the "C" mutation at the third locus are only possessed by a fraction of tumor cells and are subclonal, while the "T" mutation at the second locus is possessed by all tumor cells thus is clonal. Inference on clonal status can shed light on the timing of mutational processes and thus provides important information for personalized treatment Figure 1: (a) Illustration of tumor heterogeneity, clonal and subclonal mutations, and copy number aberrations. In this example, we have 3 genomic loci that we record mutations, and a total of 3 tumor subclones. (b) Representation of tumor subclones using L, Z and w.
strategy. For example, for patients with chronic lymphocytic leukemia, the presence of subclonal driver mutations leads to more rapid disease progression (Landau et al., 2013). The acquisition of a drug resistance allele in a fraction of tumor cells can result in clinical drug resistance and thus can reduce the efficacy of cancer therapies (Schmitt et al., 2016). For patients with colorectal cancers, the acquisition of subclonal KRAS mutation leads to resistance to cetuximab (Misale et al., 2012). Intuitively, clonal mutations generally occur earlier in time compared to subclonal mutations. A greater burden of subclonal mutations indicates a greater degree of intra-tumor heterogeneity which could be associated with worse outcome for patients with various cancer types.
There is an extensive literature on tumor heterogeneity and subclone inference. See, for example, Beerenwinkel et al. (2014) for a review. Different methods study tumor heterogeneity from different perspectives. For example, McGranahan et al. (2015) focus on deciphering the clonal status of mutations. Roth et al. (2014) propose the PyClone method, which represents subclones as clusters of SNVs and infers SNV clusters and their cellular prevalence. Deshwar et al. (2015) further impose phylogenetic constraint on SNV clusters and infer subclone phylogeny. Lastly, Sengupta et al. (2015), Lee et al. (2016), Zhou et al. (2019) and Zhou et al. (2020) directly model subclonal genomes which are characterized by overlapping sets of SNVs.
Our work complements existing methods in several aspects. First, a number of terminologies are presented in the existing methods, defined under different characterizations of tumor heterogeneity. Some terminologies have been defined conceptually rather than quantitatively. This could potentially lead to confusions and makes it hard linking one term with another. We discuss a unified mathematical representation of subclones and define the related terminologies under the unified representation in a rigorous way. Second, many existing methods, such as Sengupta et al. (2015), are based on direct inference about subclonal genomes. Direct inference is promising, because it fully characterizes the genetic structure of the subclones and implies all other quantities of interest. However, such inference is very challenging and has identifiability issue. Moreover, direct inference is computationally expensive and finds it hard to analyze thousands of mutations. Therefore, in this work, we choose to focus on identifying the clonal status of somatic mutations to achieve computational feasibility. Lastly, we recognize that validation is crucial to the development of inferential algorithms. We develop a simulation algorithm to generate hypothetical sequencing data that are akin to real data. We conduct extensive simulation studies to assess the performance and speed of our proposed method. Simulated datasets with clearly labeled clonal status could greatly facilitate the development of novel inferential algorithms, both for sanity check and for comparison to alternative methods in the field.
The remainder of the paper is organized as follows. In Section 2, we give a quantitative representation of subclones and define commonly used terminologies under the unified framework. In Section 3, we present the Mutstats method of designating clonal status of mutations. In Section 4, we propose a general algorithm to simulate read counts data and evaluate the performance of Mutstats through simulation studies. In Section 5, we analyse two samples of a breast cancer patient from The Cancer Genome Atlas (TCGA) dataset. Section 6 concludes the paper with a discussion.

Representation of Subclones
We follow the notations from the direct inference literature (e.g. Lee et al. 2016) to give a quantitative representation of subclones. Consider one tumor tissue sample that is dissected from either primary or metastatic sites. Typically, tumor samples contain normal cells to a certain extent. The fraction of tumor cells in the entire cell population is called tumor purity and is denoted by μ. For example, in Figure 1(a), the hypothetical tumor sample contains 30% normal cells (6 grey cells) and 70% tumor cells (14 cells of color blue, pink, and orange), leading to a tumor purity of μ = 0.7. This mixture could be thought of as the first level of tumor heterogeneity. By definition, the normal cells do not possess any somatic mutation and have copy number 2 at any genomic locus as we restrict our attention to autosome.
Let C denote the number of subclones, which also needs to be estimated in practice. For example, in Figure 1(a), we have C = 3 subclones. We first construct a S ×C dimensional integervalued matrix L to characterize subclonal copy numbers, where S is the number of genomic loci that we record SNVs. The (s, c)-entry of L, l sc , represents the total copy number of subclone c at locus s. An example of the L matrix is shown in Figure 1(b). Since CNAs occur in segments of genomic regions, we assume the S loci fall into K copy number segments (CNSs), with k(s) index the CNS of locus s. The loci in the same CNS have the same copy number status. Let l * kc denote the total copy number of CNS k. For two loci s 1 and s 2 in the same CNS k, i.e. k(s 1 ) = k(s 2 ) = k, we have l s 1 c = l s 2 c = l * kc . Next, we introduce a S × C dimensional integer-valued matrix Z to record subclonal SNVs. The (s, c)-entry of Z, z sc , represents the number of variant alleles at locus s of subclone c. We always have z sc l sc . Again, an example of the Z matrix is shown in Figure 1 Finally, we use a C dimensional vector w to represent the population frequencies of the subclones. The c-th element of w, w c , represents the population frequency of subclone c. We have C c=1 w c = μ. Denote byw c = w c /μ the proportion of subclone c in all tumor cells; we have C c=1w c = 1. Examples of the w andw vectors are provided in Figure 1(b). The quantities L, Z, w and μ, which are illustrated in Figure 1, provide a unified representation to fully describe the genetic structure of the subclones, and all other quantities of interest can be derived from these four. For example, SNV s is clonal if z sc 1 for all c, and SNV s is subclonal if z sc = 0 for some c. In the literature, some other terminologies are used to characterize tumor heterogeneity. Nevertheless, these terminologies can be represented by μ, L, Z and w. Table 1 summarizes some commonly used terminologies, and we will refer to some in later discussions.

Data Preparation
Denote the total number of reads and the number of reads with variant sequences as N s and n s for SNV locus s, respectively, the read counts are the only data that are directly observed. The Mutstats method starts with tumor and matched normal bam (Barnett et al., 2011) files which are generated by mapping the raw short reads from fastq (Cock et al., 2010) files to the appropriate reference genome. Using tumor and normal bam files, somatic mutation calling tools find out all the loci on chromosomes that bear SNVs. Commonly used tools include, for example, Varscan2 (Koboldt et al., 2012), MuTect (Cibulskis et al., 2013) or Muse (Fan et al., 2016).
Following existing subclone reconstruction methods, we obtain estimates of tumor purity (μ) and copy numbers in copy-number segments (l kc ) using existing purity caller and copy number caller, such as Battenberg (Nik-Zainal et al., 2012), ABSOLUTE (Carter et al., 2012) or FACETS (Shen and Seshan, 2016). In this work, we use the Battenberg caller to obtain estimates of tumor purity and copy numbers. In the following discussion, we will focus on determining the clonal status of the SNVs.
Next, we estimate mutation multiplicity for each SNV. Mutation multiplicity is defined as the average allele copies in cancer cells carrying a mutation. Denote by m s the multiplicity of SNV s. Using the notation in Section 2, m s = C c=1w c z sc . For example, in Figure 1(a), m s = 0.29, 1.29 and 0.14 for s = 1, 2 and 3, respectively. A point estimate of m s based on the read counts is computed bym wherel s is the average copy number of SNV s,l s = μ · C c=1w c l sc + (1 − μ) · 2. Recall that μ andl s are estimated by upstream bioinformatics tools. The point estimatem s is unbiased with the understanding that E(n s /N s ) = C c=1 w c z sc /l s , where the right hand side of the equation is referred to as the variant allele fraction (VAF). To distinguish between the mutation multiplicity and its point estimate, we refer to m s andm s as expected mutation multiplicity and observed mutation multiplicity, respectively.
We note that, instead of mutation multiplicity, many methods (e.g. Roth et al. 2014) use cancer cell fraction (CCF) to infer clonal status. CCF is defined as the fraction of cancer cells carrying a mutation, C c=1w c 1(z sc > 0). For example, in Figure 1(a), the CCF values for mutations 1, 2 and 3 are 0.29, 1 and 0.14, respectively. By definition, CCF should be ∈ [0, 1], i.e., not exceed 1. However, due to mapping or sequencing errors, estimates of CCF could become greater than one, which is difficult to interpret biologically since fractions cannot be greater than 1. To remedy this issue, in some methods, in practice CCFs are often artificially truncated at 1. Apparently, these artificial truncations are ad-hoc and not desirable. In contrast, multiplicity does not have a theoretical upper bound and can take any non-negative real values. Therefore, we choose to use mutation multiplicity rather than CCF to infer clonal status.

Model for Mutation Multiplicities
We model the observed mutation multiplicity for SNV s with a Gaussian distribution centered at the expected mutation multiplicity, The SNVs co-occurred in the same set of subclones (i.e. the SNVs having the same z sc 's) should have the same expected mutation multiplicities. Therefore, we assume (m s , τ s ) can only take R possible values, with Pr (m s , τ s ) = (u r , σ r ) | u r , σ r , π r = π r , for r = 1, 2, . . . , R, where R is unknown a priori and needs to be estimated. Denote by θ = (π 1 , . . . , π R , u 1 , . . . , u R , σ 1 , . . . , σ R ). Integrating out (m s , τ s ), the marginal distribution ofm s is a finite mixture of Gaussian distributions (Melnykov et al., 2010), where π r represents the weight of mixture component r, and φ(· | u r , σ r ) denotes a Gaussian density with mean u r and standard deviation σ r . We use the expectation-maximization (EM) algorithm (Dempster et al., 1977) to obtain the maximum likelihood estimate (MLE) of the parameters θ . Let ξ sr be a cluster membership indicator such that ξ sr = 1 (or 0) represents observation s belongs to (or does not belong to) cluster r, respectively. Denote by e sr the conditional probability that observation s belongs to cluster r given the parameters, .
For a fixed R, the EM algorithm starts from random initial values of θ , and then iterates between an E-step and an M-step until convergence. In the E-step, we compute e sr for all s = 1, . . . , S and r = 1, . . . , R given the current values of the parameters θ . In the M-step, we compute the MLE of θ given e sr 's. It can be shown that the parameters converge to the MLE of the Gaussian mixture model (Equation 2).
To select an optimal number of mixture components R, we run the EM algorithm with different R = 1, . . . , 7. We then choose the optimal R based on the Bayesian information criterion (BIC, Schwarz et al. 1978). The R package mclust (Fraley et al., 2016;Scrucca et al., 2016) is used for implementation.

Determining Clonal Status
After parameter estimates of the Gaussian mixture model are obtained, we use the following procedure to determine the clonal status of an SNV.
Recall that m s = C c=1w c z sc . If m s < 1, at least one z sc = 0 for c = 1, . . . , C, which suggests SNV s is subclonal. On the other hand, m s 1 does not necessarily mean SNV s is clonal, but can be due to, for example, high copy number. Therefore, additional copy number information is needed to determine the clonal status of SNV s when m s 1. Let wherem rep s is a hypothetical replication of the observed mutation multiplicity for SNV s. A larger λ s means a larger probability ofm rep s < 1 thus indicates a higher chance of m s < 1, and vice versa. Therefore, λ s can be used as a proxy to determine whether m s < 1. The advantage of using λ s is that it takes into account the uncertainty associated with the point estimates.
The probability λ s = Pr(m rep s < 1 |m s , θ ) is calculated as follows, where (· | u r , σ r ) denotes the cumulative distribution function of a Gaussian distribution with mean u r and standard deviation σ r . If λ s is large, say, λ s > H 1 for a certain threshold H 1 , we think m s < 1 thus determine SNV s is subclonal. Otherwise, if λ s H 1 , we think m s 1 and use additional copy number information to determine the clonal status of SNV s. Suppose SNV s resides inside CNS k. Using the notation introduced in Section 2, let l * kc1 and l * kc2 denote the major and minor allele-specific copy numbers (ASCNs) of CNS k in subclone c. Here, the major and minor alleles represent the most common and less common alleles in the population, and the ASCN of an allele refers to the number of copies of that allele. Note that the reference/alternative allele is different from the major/minor allele: the former is defined with respect to an individual patient, while the latter is based on the population. The following relationship is assumed: z sc max(l * kc1 , l * kc2 ), meaning that the copy number of the alternative allele should be less than or equal to the maximum ASCN. The Battenberg caller provides aggregated information about ASCNs, and we utilize such information to determine the clonal status of SNV s. The following two scenarios are possible. 1. If Battenberg determines that CNS k has one copy number state, it outputs a pair of ASCNs (q k1 , q k2 ). Essentially, Battenberg thinks that l * kc1 = q k1 and l * kc2 = q k2 for all c. Denote bỹ q k = max(q k1 , q k2 ). If m s >q k , we identify s as clonal, since a subclonal SNV cannot produce a mutation multiplicity greater than w c max(l * kc1 , l * kc2 ) =q k . To determine whether m s >q k , we calculate and if this is greater than some threshold H 2 , we think m s >q k and classify s as clonal.
Otherwise, we let s be unclassified at this time. 2. If Battenberg determines that CNS k has two copy number states, it outputs two pairs of ASCNs, (q 1 k1 , q 1 k2 ) and (q 2 k1 , q 2 k2 ), as well as their population frequencies in the tumor cells, ρ 1 and ρ 2 . Implicit in the two sets of ASCNs is a partition of the C subclones, {1, . . . , 2 k , s should be clonal. Again, this is because a subclonal SNV cannot produce a mutation multiplicity greater than w c max(l * kc1 , l * kc2 ) = ρ 1q and if this is greater than H 2 , we think m s > ρ 1q 1 k + ρ 2q 2 k and classify s as clonal. Otherwise, we temporarily assign unclassified status to s. For simplicity, we use the same threshold H 2 for both scenarios, but it is possible to consider different thresholds. The values of H 1 and H 2 can be specified through simulation. Specifically, in the simulation studies to be presented in Section 4, we first simulate 10 tumor samples, for which the true clonal status of the mutations are known. Next, we run Mutstats on the simulated data with a grid of values of (H 1 , H 2 ). Lastly, we find the optimal threshold values that lead to the highest classification accuracy on the 10 simulated samples. The value of H 1 can also be chosen based on a desirable maximum false discovery rate (MFDR) threshold, denoted by f 0 . Let λ (k) denote the k-th largest value of {λ 1 , λ 2 , . . . , λ S }, and let where |A| denotes the cardinality of the set A. We may set H 1 = λ s * . This error rate is denoted as MFDR because 1 − λ s = P r(m rep s 1|m rep s , θ) does not necessarily imply that the SNV is clonal, due to the existance of unknown status. Therefore, it is an overestimation of the true false discovery rate. In the real data analysis (Section 5), the threshold H 1 is selected using this approach (while H 2 is still chosen based on simulation). The procedure described above leaves some SNVs unclassified. We further use linear discriminant analysis (LDA) to find their clonal status, treating the already-classified SNVs as training data. For SNV s, let x s = (N s , n s ,m s , λ s ), and let y s ∈ {0, 1} denote its clonal status (0 for subclonal and 1 for clonal). Recall that N s and n s are the total number of reads and number of reads with variant sequences mapped to the location of SNV s, respectively,m s is the observed mutation multiplicity, and λ s is defined in Equation (3). We use x s as covariates to predict y s . The LDA method assumes that the conditional distributions for [x s |y s = 0] and [x s |y s = 1] are normal with parameters (μ 0 , ) and (μ 1 , ), respectively. The predicted clonal status for an unclassified SNV s is where κ i = # of SNVs with clonal status i Total # of SNVs is the relative frequency of clonal status i, and φ(x | μ i , ) denotes a multivariate Gaussian density with mean μ i and covariance matrix . The R package lda is used to implement the LDA algorithm. With the additional LDA step, the proposed Mutstats algorithm is able to determine the clonal status for all SNVs. The flow of the entire Mutstats algorithm is given in Figure 2.

Brief Review of Data Simulation Approaches
Despite the increasing effort in generating and providing access simulation approaches that may generate and mimic real-world NGS data could be extremely helpful for research and method development. Numerous simulators have been developed for simulating DNA sequencing data for various applications. For instance, Huang et al. (2012) developed the ART simulator that generate synthetic NGS reads, Shcherbina (2014) constructed the FASTQSim simulator that provides dual functionality of NGS dataset characterization and metagenomic data generation, Qin et al. (2015) proposed the SCNVSim tool for simulating somatic CNVs and structure variations SVs, and Xia et al. (2017) developed the Pysim-sv that simulates HTS data to evaluate performance of structural variation detection algorithm. And recently, Yu et al. (2020) developed the SimuS-CoP tool to emulate complex DNA sequencing data. While these methods simulate the lower level data such as reads data, we proposed a general simulation algorithm to directly generate read counts data, which could be directly used by researchers for developing novel methods that take read counts data as input.
Our proposed simulation scheme is anchored by the theory of tumor cell evaluation based on clonal and subclonal somatic mutations (Figure 1). We 1) generate the proportion of the tumor cells and normal cells, 2) generate the allele specific copy numbers that could be clonal or subclonal, and 3) simulate the number of somatic mutations at each loci given in step 2. These are well-known cancer cell genomics that have been explained in Nik-Zainal et al. (2012) and the more recent ICGC landmark publication .
In this work, we assess Mutstats via simulation studies. The detailed data simulation scheme consists of the following steps. 1. Generate tumor purity μ, number of subclones C, number of loci S and number of CNSs K from uniform and discrete uniform distributions, Generate population frequencies of the subclonesw from a Dirichlet distribution, w ∼ Dir(a w , a w , . . . , a w ).
Here, μ min , μ max , . . . , K min , K max are user specified lower and upper bounds for the simulation parameters, and a w is a user specified hyperparameter. 2. For each CNS k ∈ {1, 2, . . . , K}, let α k be an indicator of its clonal status. If C = 1, α k = 1; otherwise, generate α k from a Bernoulli distribution, where p α is a user specified hyperparameter. According to the value of α k , there are two possibilities. (a) If α k = 1, CNS k is clonal. Draw common ASCNs (q k1 , q k2 ) for all subclones from discrete uniform distributions, where q max is a user specified maximum copy number. Set (l sc1 , l sc2 ) = (q k1 , q k2 ) for all loci in CNS k and all subclones, i.e. for all s ∈ {s : k(s) = k} and c ∈ {1, 2, . . . , C}. (b) Otherwise, if α k = 0, CNS k is subclonal. Draw two sets of ASCNs (q 1 k1 , q 1 k2 ) and (q 2 k1 , q 2 k2 ) from discrete uniform distributions, For each subclone c ∈ {1, 2, . . . , C}, set (l sc1 , l sc2 ) = (q 1 k1 , q 1 k2 ) or (q 2 k1 , q 2 k2 ) with equal probability for all loci in CNS k. 3. For each SNV s ∈ {1, 2, . . . , S}, let β s be an indicator of its clonal status. If C = 1, β s = 1; otherwise, generate β s from a Bernoulli distribution, According 4. Let φ denote the expected total number of reads for each locus assuming the locus has no CNA. Generate φ from a Gamma distribution, Then, for each SNV s, generate the total number of reads from a Poisson distribution, N s ∼ Poi(φl s /2), wherel s = μ · C c=1w c l sc + (1 − μ) · 2 is the average copy number of SNV s. Next, for each SNV s, calculate its VAF by Finally, generate the number of reads with variant sequences from a binomial distribution, n s ∼ Bin(N s , v s ).
Following the steps above, the simulation algorithm generates tumor purity μ, subclonal copy number matrix L, variant allele count matrix Z and subclonal population frequency vector w, as well as total number of reads N s and number of reads with variant sequences n s for each SNV s. A web-based tool for simulating and visualizing read counts data can be found at https://compgenome.shinyapps.io/tumorsim. Figure 3 shows the plots of N against n/N for six simulated tumors with different purities and numbers of subclones, as well as three samples from real data. It is clear that the simulated data highly resembles that of the real data.

Performance on Simulated data
Using the simulator proposed in Section 4.1, we simulate 1000 samples of various choices of purity, number of subclones, number of SNVs and copy number profiles. Furthermore, we excluded the simulated samples with no subclonal SNVs, leaving 868 samples. To select the threshold values H 1 and H 2 (see Section 3.3), we randomly generate 10 additional tumor samples. We define a  Table 2. We use three metrics to evaluate the performance of the proposed method: classification accuracy (or accuracy), sensitivity, and specificity, defined as Here, S is the total number of SNVs in a simulated sample, y true s refers to the true clonal status of SNV s, and y s is the clonal status of SNV s determined by Mutstats. From Table 2, Mutstats performs well on the simulated data with a high accuracy (around 80%), a high sensitivity (more than 85%), and a good specificity (around 70%).

Sensitivity Analysis of H 1 and H 2
Since the status of the outcome are determined based on the threshold values H 1 and H 2 , they are important components of our proposed algorithm and we are performing some simulation studies with various values of H 1 and H 2 to investigate their effects. To investigate the effect of H 1 , we hold H 2 constant at 0.16, and vary H 1 from 0.1 to 0.9. The result is shown in Figure 4a. We observed that while holding H 2 constant, an increase in H 1 results in an increase in both accuracy and sensitivity (fraction of correct subclonal calls among the true subclonal SNVs) but a decrease in specificity (fraction of correct clonal calls among the true clonal SNVs). The reason for increasing sensitivity and accuracy in Figure 4a is due to the fact that higher H 1 values give fewer but more confident subclonal calls, while holding H 2 constant. When H 1 increases to an extreme large value (say 0.99), sensitivity eventually starts to drop due to too few subclonal calls although they are accurate. Specificity is slightly decreased due to more SNVs are not classified as subclonal and sent to the next step for clonal calls. On the other hand, in 4b we fix H 1 = 0.86 and increase the value of H 2 's. The pattern shows that both sensitivity and accuracy decreases when H 2 is over 0.5. This is because when H 1 is fixed, the number of subclonal SNVs called by our algorithm is fixed and increasing H 2 will result in fewer and more accurate clonal calls. When H 2 is large enough, the reduction in the number of clonal calls eventually causes specificity to drop and also leads to decreasing accuracy.
So to summarize, to achieve a balanced performance, we suggest a reasonably large fraction for H 1 (around 0.9) and a reasonably small fraction for H 2 (around 0.2). In our analysis, we used H 1 = 0.86 and H 1 = 0.16.

Comparison with Existing Methods
For comparison, we run two existing tools, PyClone (Roth et al., 2014) and the method proposed by McGranahan et al. (2015) (which we denote by NMCS, from the first author's initials), on a randomly selected set of 100 samples out of the 868 simulated samples.
PyClone does not directly infer the clonal status of mutations. Instead, it clusters SNVs based on their cancer cell fractions (CCFs), and some post-processing steps are necessary to identify the clonal status of SNVs. Recall that the CCF of a SNV is a value between 0 and 1 and represents the fraction of cancer cells carrying it. The SNVs belonging to the same cluster are thought of having the same CCF value. By definition, SNVs with CCF values close to 1 are classified as clonal, while SNVs with CCF values much lower than 1 are classified as subclonal. As PyClone outputs each SNV with its cluster membership as well as the posterior mean of its CCF value, we use the following three different post-processing procedures to determine the clonal/subclonal status for each cluster of SNVs based on the posterior CCF value, using a probability threshold of 0.9. 1. PyClone Median: For an SNV cluster with multiple SNVs, we obtain the median of the posterior means of their CCF values. Based on whether or not this median value is greater than 0.9, we assign clonal or subclonal status to all SNVs of the cluster, respectively. For a cluster with only one SNV, we obtain the posterior median of its CCF value and compare this value to 0.9 to determine its clonal status. 2. PyClone 75th-tile: For an SNV cluster with multiple SNVs, we obtain the 75th percentile of the posterior means of their CCF values. For a cluster with only one SNV, we obtain the 75th percentile of the posterior draws of its CCF value. We compare these values with 0.9 to determine the clonal status of the SNVs. 3. PyClone 95th-tile: For an SNV cluster with multiple SNVs, we obtain the 95th percentile of the posterior means of their CCF values. For a cluster with only one SNV, we obtain the 95th percentile of the posterior draws of its CCF value. We compare these values with 0.9 to determine the clonal status of the SNVs. NMCS, on the other hand, determines the clonal status of a mutation based on the confidence interval of its CCF value. If the 95% CCF confidence interval of a mutation overlaps 1, the mutation is classified as clonal; otherwise, the mutation is classified as subclonal. Figure 5 summarizes the accuracy, sensitivity and specificity of all methods on the 100 simulated tumor samples. We find that PyClone performs well in terms of accuracy and sensitivity, while NMCS is better in terms of specificity. Mutstats, on the other hand, achieves good performance on all three metrics. Specifically, the accuracy and sensitivity of Mutstats are comparable to those of PyClone, and the specificity of Mutstats is clearly higher than that of PyClone. Compared to NMCS, although Mutstats has a somewhat lower specificity, it has significantly higher accuracy and sensitivity. Overall, Mutstats is able to combine the strength of PyClone and NMCS.
Furthermore, Mutstat is an ultra-fast method. Table 3 shows the average running time of the three methods. The running time of Mutstats is similar to that of NMCS (in seconds) and on average can be 9000 times faster than PyClone (which runs in hours). Moreover, Figure 6 shows the ratio of classification accuracy to running time for each method and each simulated sample. The higher the ratio, the better the method in terms of balancing between classification accuracy and computation time. On average, Mutstats has the highest accuracy-to-time ratio, and on 92 out of 100 samples, Mutstats outperforms PyClone and NMCS.

TCGA BRCA Data Analysis
To demonstrate the practical usage of our method, we apply Mutstats to the analysis of two breast invasive carcinoma (BRCA) samples from The Cancer Genome Atlas (TCGA) project. The two samples are from the same TCGA donor (A15E), with one sample dissected from the  We run Mustats, PyClone, and NMCS on both samples to determine the clonal status of the SNVs. The threshold value H 1 is chosen based on a FDR threshold of f 0 = 0.05 (see Equation 7). For H 2 , we keep the same value of H 2 = 0.16 as in the simulation studies. In addition, due to the large number of SNVs in the data, we randomly selected 2,000 SNVs from both the primary and the metastatic samples to reduce the computation burden of the PyClone algorithm. The results are shown in Table 4 below. All three methods yield similar distributions of SNV classifications for the Metastatic sample. For the primary sample, Mutstats is also similar to the PyClone method. Furthermore, for the proposed Mutstats method, among the 5,534 shared mutations, 1,834 are identified as having the same clonal status in both samples with 1,826 subclonal and 8 clonal mutations. On the other hand, 3,700 shared mutations are classified differently in the primary and metastatic samples. In particular, 3,696 SNVs are classified as clonal in primary but subclonal in metastatic, while 4 are classified as subclonal in primary but clonal in metastatic. Lastly, among the unique mutations, Mutstats identifies that 4,991 are subclonal and 4,962 are clonal in the primary sample, and 12,446 are subclonal and 35 are clonal in the metastatic sample. In summary, compared to the primary tumor, the metastatic tumor has a much larger portion of subclonal mutations and a greater degree of intra-tumor heterogeneity. This finding suggests that the tumor progression is worrisome.

Discussion
We have made three major contributions in this paper. First, we have clarified several terminologies used by the community and have provided their definitions from a unified model (Table 1). Second, we have developed a simulator from that unified model to generate realistic tumor data with multiple subclones and different purities. Finally, we have presented an ultra-fast method to distinguish between clonal and subclonal mutation given read count data and output from a copy number caller. With the help of the tumor data simulator, we have evaluated our method on 1000 synthetic tumors. We have also run our algorithm on two real samples from the same TCGA patient.
In our method, the two threshold values H 1 and H 2 expressed the confidence on classifying the current SNV as subclonal and clonal, respectively. Through sensitivity analysis, we found that a reasonably large H 1 around 0.9 and reasonably small H 2 around 0.2 provide a good tradeoff of sensitivity and specificity.
The speed of Mutstats is the main attractive feature for practical applications. For example, compared with the popular PyClone method, Mutstats achieves over 9,000 fold reduction in computing time in the comparative study in §4.4. The main reason is that Mutstats avoids lengthy MCMC computation that is required by the PyClone method.
The proposed Mutstats methods can be modified and extended in several ways. First, we have utilized ASCN information from the Battenberg caller to determine the clonal status of a mutation. Although empirically we find that ASCN information improves our classification accuracy, many other copy number callers do not provide such information. In the absence of ASCN information, we may replaceq k and ρ 1q 1 k + ρ 2q 2 k in Equations (5) and (6) by the average copy numberl s . Intuitively, ifl s is large, then m s should also be large for SNV s to be clonal, which can be determined by whether Pr(m rep s >l s |m s , θ ) is greater than some threshold. Another interesting future direction is to develop alternative machine learning algorithms for clonal status classification. The proposed tumor simulator can generate a large number of tumor samples that resemble real-world datasets. The simulated tumor samples have labeled clonal status (ground truth) thus can be used as training data to train a machine learning algorithm for classification of the mutations in a real tumor sample.
Understanding the clonal status of mutations is very important for understanding the overall evolution process of a tumor. Such knowledge is essential for understanding of timing of mutations. By finding clonal status for actionable driver mutations, we hope to improve future drug design and strategies for advanced treatment in cancer.