Dynamic Classification of Plasmodium vivax Malaria Recurrence: An Application of Classifying Unknown Cause of Failure in Competing Risks

A standard competing risks set-up requires both time to event and cause of failure to be fully observable for all subjects. However, in application, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of Plasmodium vivax malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. In this paper, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to Plasmodium vivax infection data to classify recurrent infections of malaria.


Plasmodium vivax Malaria Infection
Plasmodium vivax, in short, P. vivax, is the most widespread human malaria (Howes et al., 2016). According to the 2019 World Malaria Report released by World Health Organization (WHO), 53% of the global P. vivax burden is in the South-East Asia Region, and 75% of malaria cases in the Region of the Americas are resulted from P. vivax. Due to the dormant liver stage of P. vivax, hypnozoites may reactivate and cause another infection weeks to months after the initial infection (Chu and White, 2016). Relapse due to inadequately treated blood stages is less common and is referred to as treatment failure or recrudescence. Therefore, when first-line anti-malarials are used, relapse is usually attributed to hypnozoiteinduced relapse. P. vivax relapses are an important source of morbidity and contribute to malaria mortality (Dini et al. 2020, Robinson et al. 2015, Baird 2013. However, the fact that individuals can also become reinfected due to a new mosquito bite makes it difficult to study the anti-relapse efficacy of treatment. Previous studies have concluded that even when the level of transmission is relatively low, there is a high genetic diversity in P. vivax parasites within patient populations in Cambodia (Lin et al., 2013, Friedrich et al., 2016. Such genetic diversity, often resulting in multiple parasites haplotypes present in a single infection, provides an opportunity for researchers to distinguish relapse from a recurrent infection by examining the overlap of haplotypes between infections and the appearance of haplotypes associated with relapse. Lin et al. (2015) applied targeted deep sequencing to 108 isolates collected from 78 Cambodian volunteers with P. vivax infection (Lon et al., 2014). Subjects in the study were treated initially with dihydroartemisinin-piperaquine (DP), an effective drug to treat the blood stages of P. vivax, all but precluding treatment failure due to recrudescence. To detect recurrent infection, blood smears of study subjects were taken firstly at baseline, then weekly for six weeks following treatment, then monthly thereafter. At the end of the study, 23 of the 78 subjects experienced recurrent infections, with a median of 68 days in the time to recurrence. Subjects' participation in the study ranged from 2 to 6 months, with a median of 4 months follow-up. Since treatment failure with DP is unlikely, these recurrences most likely represent relapse or reinfection. In fact, of the 23 subjects with recurrent infection, five subjects had a second recurrent infection, and one subject had a third recurrent infection. To simplify the analysis, we only consider the first recurrent infection among those 23 subjects. Figure 1 shows the Kaplan-Meier curve for the first recurrent infection along with the risk table showing the number of subjects at risk over tenday intervals. The horizontal axis in the plot indicates days from baseline, and the vertical axis is the estimated survival probability. The solid line is the step function and shaded area is associated 95% point-wise confidence interval of the step function. The longest follow-up time is 180 days, and 70% (55 subjects) were disease-free at the end of the follow-up period. A subject-by-subject time to first infection plot is given in the Supplementary Materials. the 42-kDa region, which exhibits great nucleotide diversity. After extracting DNA from filter paper blood spots, Lin et al. (2015) applied deep sequencing to this region and used a bioinformatics pipeline called SeekDeep (Hathaway et al., 2018) to determine different haplotypes of pvmsp1 defined by at least a single nucleotide difference between haplotypes. They identified 67 unique pvmsp1 haplotypes across 108 isolates from either initial infection or recurrent infections, with each patient isolate harboring, on average, three different haplotypes. They found nine haplotypes that are common and appeared in at least 10% of individuals. 46 rare haplotypes appeared in only one isolate, with some later attributed to sequencing error. Only 41 unique haplotypes were identified in those subjects with recurrent infection. Figure 2 shows a heatmap that indicates the presence/absence of these 41 haplotypes (genetic variants) in the initial and recurrent infections from those 23 subjects. Each column represents one unique haplotype, and each row represents one subject with an identification number. The subjects were sorted based on their time to the first recurrent infection, with the shortest time at the top and the longest time at the bottom. Pink cells indicate the presence of the haplotype in the initial infection but absence in the recurrent infection. Blue cells show the absence of the haplotype in the initial infection but presence in the recurrent infection. Purple cells show haplotypes that were present in both infections. Interestingly, only 16 subjects had overlapping haplotypes between initial and recurrent infections. Two subjects with the shortest time to recurrent infection did not have any shared haplotypes.

Competing Risks with Unknown Cause of Failure
It is commonly seen in biomedical research that the occurrence of an event during the follow-up period can be attributed to one of multiple causes. Data of this type is a standard competing risks set-up, where one event occurs per subject, and the failure type is one of many possible causes. Usually, both time to event and the cause of failure are observable. However, in some cases, the cause of failure may be unknown or missing. For example, in P. vivax malaria research, subjects who live in endemic areas suffer recurrent infections which can arise from (1) mosquito bites representing new infection, (2) relapse from latent infection in the liver, or (3) recrudescence due to treatment failure. The cause of recurrent infection is unknown or indeterminable in this case, thus impeding the efficacy assessment of anti-relapse treatment. Developing a reliable method to distinguish new infections from relapse is critical.
The problem of missing cause of failure in competing risks data has been given much attention since Dinse (1982). There are two possible approaches for estimating competing risks data with missing cause of failure when the cause is missing at random (Rubin, 1976): (1) complete-case analysis, utilizing only complete observations, e.g., Effraimidis and Dahl (2014), or, (2) construct a regression model for the missing cause using all observations, including those with missing cause of failure. In the second approach, one can use a global parametric model (Lu and Tsiatis, 2001), a semi-parametric framework (Goetghebeur and Ryan, 1995) or a nonparametric regression method (Gouskova et al., 2017) to estimate the cause-specific hazard functions. A similar problem is also considered in Sun and Gilbert (2012) and Juraska and Gilbert (2016) when considering the competing cause as a mark for the mark-specific hazard function. A doubly robust estimator is proposed in these papers when the mark variable is possibly missing. However, these approaches require at least some of the observations to have complete records. They cannot be applied to the problem in P. vivax malaria research, where the cause of failure is unknown for every subject.
When analyzing the causes of P. vivax malaria recurrence from a competing risks perspective, it is natural to assume that the time to recurrent infection is associated with baseline covariates (e.g., genetic variants or haplotypes) collected at the initial infection. We assume that each cause has a distinct cause-specific hazard function conditional on the baseline covariates, enabling us to build an initial cause classifier that can distinguish the cause based on the time to recurrence information. Subsequently, by observing changes in the values of genetic variants between initial and recurrent infections, one can build another classifier that can distinguish the cause of failure, as the changes are driven by the latent cause. Thus, one can update the initial classifier by utilizing the information contained in the transition of covariates between initial infection and recurrent infection. To study the transition mechanism, Lin et al. (2020) proposed an approach that estimates the transition likelihoods using both shared and non-shared genetic variants to improve classification accuracy when the cause of recurrent infection is unknown. Bureau et al. (2003) utilized a continuous-time hidden Markov chain to obtain the true transition probabilities between states when the disease status is possibly misclassified. However, Lin et al. (2020) did not consider the time to recurrent infection, and Bureau et al. (2003) required the disease status to be fully observed but subject to misclassification. Neither of these approaches is ideal for our malaria data, and can not be applied to the classification problem when dealing with competing risks data with missing cause of failure.
In the classification problem with unknown cause of malaria recurrence, Taylor et al. (2019) proposed a Bayesian approach that models the time to recurrent infection for prior classification probability and then computes the posterior probability based on an assumed genetic model with a strong prior assumption. Ferreira MU, de Sousa TN et al. (2020) treated relapse (combined with recrudescence) and new infection as competing risks assuming an exponential distribution with a time-constant hazard for both causes. In contrast, we analyze the time to event data under a competing risks set-up without specifying any temporal pattern of the hazard function. We generalize the idea in Lin et al. (2020) to incorporate the transition likelihoods between covariates to classify the unknown cause of infection. By considering the time to event information and transition likelihoods at the same time, we utilize more information from the data and thus lead to a more accurate classifier. Our method allows the causes of failure to be completely missing and can be applied to P. vivax malaria data (Lin et al., 2015). The classification procedure includes two main steps. First, we utilize the time to event and baseline covariates information to obtain an initial classifier. Then, we update the classification probability obtained in the first step using transition likelihoods between covariates to obtain the second classifier, whose performance is better than the first one. The challenges of building these classifiers are that the covariates are high-dimensional, and they can be of different kinds of variables. To resolve the first challenge, we propose a penalized maximum partial likelihood estimator and use an efficient proximal gradient descent algorithm to obtain the estimator. To resolve the second challenge, we propose a general transition likelihood that can incorporate different kinds of variables.
The rest of this paper is organized as follows. In Section 2, we describe the method of modeling competing risk data under a proportional hazards model with baseline covariates. In Section 3, we introduce general formulae for the two classifiers. An algorithm for the computation of parameters needed for constructing the classifiers is laid out in Section 4. We carry out comprehensive simulation experiments under various scenarios to evaluate the performance of the proposed classifiers in Section 5. Finally, we apply the developed method to the P. vivax malaria data and show the classification result in Section 6. We summarize our current approach and discuss its extensions in Section 7.

Model and Estimation
In a general setting of competing risks, let T i * be the failure time and ϵ i ∈ {1, 2} be the cause of failure for subject i. We consider only two causes of failure since this is the most general setting of competing risks application. If there are more than two causes, one may combine causes other than the primary interest into one category and format the model with two causes of failure. To model the time to failure when competing risks are presented, we consider a cause-specific hazard function for cause k, (k = 1, 2), defined by: λ ik (t) = lim dt 0 P t ⩽ T i * < t + dt, ϵ i = k | T i * ⩾ t /dt. With X i = (X i1 , …, X iJ )′ being the J-dimensional vector of covariates at the baseline, we consider a proportional hazards model for the cause-specific hazard function, defined by λ ik (t; β) = λ 0k (t)exp β k ′X i , where λ 0k (t) is the baseline hazard function for cause k, β k = (β i1 , …, β kJ )′ is the vector of regression coefficients, and β = β 1 ′ , β 2 ′ ′ (Kalbfleisch and Prentice, 2002, Section 8.2).
When the causes of failure are fully observed and time to failure is right-censored, one observes T i = min T i *, C i , δ i = I T i ⩽ C i , and failure type ϵ i when δ i = 1, where I(·) is the indicator function. Assume {T i , δ i , δ i ϵ i , X i } are i.i.d. for i = 1, …, n. Under the fully observed data, we estimate β using the partial likelihood function where δ ik = δ i I(ϵ i = k) indicates whether the failure of cause k occurs, and R i ≡ {l: T l ⩾ T i } is a set of subjects who are at risk at T i . However, in our case, neither cause was observed. Thus, the partial likelihood function above is not feasible since δ ik is not observable. When neither cause is observed, the available data is {T i , δ i , X i } for i = 1, …, n, which is identical to the conventional right-censoring time to event data. The partial likelihood function for β is where λ i (t) is the overall hazard function. Assuming only one event can occur at time t + dt, one writes the overall hazard function as λ i (t) = ∑ k = 1 2 λ ik (t) since P t ⩽ T i * < t + dt | T i * ⩾ t = ∑ k = 1 where the baseline hazard function λ 0k (t) cannot be completely unspecified for k = 1, 2, unlike the partial likelihood function in (1).
The primary interest of the competing risks model in our application is written as λ i2 (t) = λ 0 (t)exp β′X i .
This model fits naturally with the P. vivax malaria data we intend to analyze. Reinfection is considered as the first cause of failure (ϵ i = 1) that randomly occurs from the environment following a time-to-event distribution with no association with the baseline covariates X i . We assume its hazard λ i1 (t) can be written as the baseline hazard λ 0 (t) attenuated by a constant factor exp(α) as shown in model (3). The hazard function λ i1 (t) is considered as the background hazard. For the P. vivax malaria study, λ i1 (t) represents a random mosquito bite from the living or working environment. Relapse is considered the second cause of failure (ϵ i = 2) that is associated with the baseline covariates X i in model (4), which follows a proportional hazards model. These two causes of failure compete to occur, and only one of the causes, either relapse or reinfection, would occur if the event time is not censored. Under models (3) and (4), both hazard functions share the same baseline hazard λ 0 (t). The ratio of λ i1 (t) and λ i2 (t) only depends on baseline covariates X i , and can be considered as a semiparametric two-sample density ratio model promoted by Qin (1998). The baseline hazard λ 0 (t) here needs no specification, and can be any function of time. It can also be a function of covariates, under the condition that covariates included in λ 0 (t) are independent of those in X i .
Without any specification of λ 0 (t), one can use the partial likelihood function to estimate θ = (α, β′)′ where α and β are unknown parameters of interest. However, the dimensionality of θ is a concern in our case since genetic sequencing produces a large number of haplotypes that are considered as covariates in our model. In Section 4, we introduce a penalized maximum partial likelihood method to estimate the high-dimensional θ.
function for Λ i (t) = Λ 0 (t){exp(α) + exp(β′X i )}. The estimation involves not only parameter estimates for θ = (α, β′)′, but also baseline hazard estimate for Λ 0 (t) = ∫ 0 t λ 0 (s)ds. One can use a Breslow-type estimator Λ 0 (t) = ∑ i = 1 n I T i ⩽ t δ i / ∑ j ∈ R i exp(α) + exp β ′ X j for Λ 0 (t) and calculate a test statistic T (x) = ∑ i = 1 n I β ′ X i ⩽ x M i for a lack-of-fit test over the follow-up time. One can construct a confidence band for T(x) via Monte-Carlo simulation, as proposed in Lin et al. (1993). Model diagnosis results for the P. vivax malaria data are given in Section 6.

Classification
We propose two classifiers to classify the event to one of the two causes. The first classifier uses the baseline information and partial likelihood function (5) to obtain the initial estimate of the probability that the event is of cause k. The second classifier updates the first classifier using transition likelihoods under different causes. We expect that the second classifier will perform better when the transition of covariates is informative since more information is involved. If the transition of covariates is not informative of the cause of failure, the second classifier improves little from the first classifier.

Based on Baseline Information
Let N i *(t) be the number of events up to time t, and dN i *(t) = N i *(t + dt) − N i *(t) be the event indicator in the next instantaneous time dt after t. The observed counting process be the probability of cause k, given that an event occurs in [t, t + dt) and the realization of baseline covariate is X i = x i . We have: ξ ik (0) (t) = P ϵ i = k | d N i (t) = 1, X i = x i = λ ik (t; θ)/λ i (t; θ). If an event occurs at T i = t i for where θ is the maximum partial likelihood estimator of θ in (5). Since formulae (6) and (7) are independent of t i , we write ξ i1 We classify an event to be of cause 2 if ξ i2

Based on Both Baseline and Event Information
When an event occurs for subject i, we assume that Z i = (Z i1 , …, Z iJ )′ is collected at the event time, which is the same set of covariates as baseline covariates X i . We propose to utilize the transitions from X i to Z i to aid the cause classification. Let ξ ik (1) (t) = P ϵ i = k | dN i (t) = 1, X i = x i , Z i = z i be the probability of cause k given realizations is the conditional density function of Z i given X i under cause k. We call ϕ i (k) the conditional transition likelihood of cause k. One can treat the classification probability ξ ik (1) (t) as an updated version of ξ ik (0) (t) by the ratio of transition likelihoods between possible causes since for ℓ = 1, 2 and ℓ ≠ k. Note that if the transition likelihoods are informative, ϕ i (1) and ϕ i (2) will be very different from each other and thus lead to a more accurate classification of ξ ik (1) (t).
We assume that the transition likelihood ϕ i (k) follows a parametric model ϕ i (k, γ k ), where γ k is the vector of parameters to be estimated. More details of this parametric model ϕ i (k) follow in Section 3.3. The distribution of Z i is a mixture of transition likelihoods from two latent causes: With ξ ik (0) (t) being estimated by ξ ik (0) , and let m = ∑ i = 1 n δ i be the number of subjects having recurrent infections. We estimate γ k by maximizing a pseudo log-likelihood function: Let γ 1 ′ , γ 2 ′ ′ = argmax γ 1 , γ 2 ℓ γ 1 , γ 2 and write ξ ik (1) in short for ξ ik (1) t i . We estimate ξ ik (1) by We classify the event to be of cause 2 if and only if ξ i2 (1) > ξ i1 (1) .

Transition Likelihood
The transition likelihood plays a critical role in classification. In this section, we discuss a generalized linear model to model the transition likelihood function ϕ i (k, γ k ). Suppose the density of Z ij conditioning on X ij and ϵ i = k has the form of where a(·), b(·) and c(·) are known functions, ϑ ijk is the natural parameter, and ψ jk is the dispersion parameter (McCullagh and Nelder, 1989). Let g(μ ijk ) = ϑ ijk be the cause-specific canonical link function, where μ ijk = E(Z ij |ϵ i = k, dN i (t) = 1, X ij = x ij ). We define ϕ i (k, γ k ) as: where g(μ ijk ) = q jk0 + x ij q jk1 , q jk0 is the intercept term and q jk1 is the coefficient of x ij .
Our proposed transition likelihood model manifests differently according to the type of covariates. We give three examples showing how to construct ϕ i (k, γ k ) when the covariates are binary, normal, or Poisson.
Example 2 (Normal Covariates). When X ij and Z ij are normally distributed covariates, we have where the link function g is an identity function. The transitional likelihood for cause k becomes where γ k = q k0 ′ , q k1 ′ , q k *′ , ψ k ′ ′ and ψ k = (ψ 1k , …, ψ Jk )′.
To estimate θ in (5), we propose to solve a penalized partial likelihood problem where ν is a positive tuning parameter and p(β) is a penalty function. When sample size n is larger than the number of covariates J, (14) is a low-dimensional problem, in which case we set ν = 0. When n is smaller than J, (14) is a high-dimensional problem, in which case we choose the optimal ν by minimizing the Bayesian Information Criterion (BIC, Schwarz, 1978), which is given by BIC = 2 ℓ (θ) + c ⋅ log(n), where c is the number of covariates selected in the model. Popular choices of p(β) include the L 1 -penalty (Tibshirani, 1996), the elastic net penalty (Zou and Hastie, 2005), or some folded concave penalty (Fan and Lv, 2011). In this paper, we choose the L 1 -penalty.
We stop iterating the algorithm when the change in the objective function between two consecutive iterations is less than ζ% of the objective function's value at the former iteration, where ζ ∈ (0, 100) is a user-defined stopping threshold, which we choose to be 10. A detailed algorithm is summarized as follows:

Classification Algorithm
We give a complete algorithm for classifying the causes of an event by using the time to event information T i and δ i , baseline covariates X i , covariates collected when the event occurs Z i , and external informative covariates W i in this section.
Firstly, Given X i , T i , and δ i , estimate θ using partial likelihood (5) and Algorithm 1.

Simulation Experiments
To study the improvement in classification using transition likelihoods compared with using baseline information alone, we carry out comprehensive simulation experiments to evaluate the performance of two classifiers based on ξ i (0) and ξ i (1) respectively. We evaluate the performance of the proposed classifiers by comparing their sensitivity, specificity, and overall accuracy in classifying the causes of events. We mimic the data observed in the P. vivax malaria infection study (Lin et al., 2015) and assume that the cause could be either reinfection (ϵ i = 1) or relapse (ϵ i = 2). Sensitivity is defined as the number of subjects correctly classified as relapse divided by the number of relapse subjects; specificity is defined as the number of subjects correctly classified as reinfection divided by the number of reinfection subjects and overall accuracy is defined as the number of correctly classified subjects divided by the number of subjects.
Following the proposed model in Section 2, we assume that the baseline hazard is a homogeneous Poisson process with hazard function λ 0 (t), which is a constant for t > 0 and the same for all subjects. Using the partial likelihood function (5), we do not need to specify λ 0 (t) and expect the classification performance to be similar under different baseline hazard functions. We carry out simulations with three different baseline hazard functions λ 0 (t) = exp(τ), where τ = −0.5, 0, 0.5.
The reinfection process was assumed to be the same for all subjects with hazard function λ i1 (t) = λ 0 (t) exp(α). The relapse process was assumed to have a proportional hazard function λ i2 (t) = λ 0 (t) exp(β′X i ) for subject i. The first classifier classifies a recurrent infection as a relapse if ξ i2 (0) > 0.5, and the second classifier classifies a recurrent infection as a relapse if ξ i2 (1) > 0.5.
We consider two situations where X i and Z i are binary and normally distributed variables. We allow dimensions of X i and Z i to be either low or high. Under the low-dimensional settings, we set two combinations for n and J, with (n, J) = (400, 10) and (n, J) = (800, 20). For the high-dimensional settings, we focus on the classification performance of the classifiers, as well as the variable selection performance. We consider (n, J) = (100, 200) and (n, J) = (200, 400), where the former is closer to the real P. vivax malaria infection study.
When evaluating the variable selection performance, we focus on the sensitivity, specificity, and overall accuracy of selecting covariates with non-zero regression coefficients.
Remark that the improvement of the second classifier is mainly attributed to including the transition likelihoods from the baseline covariates X i to the covariates at recurrence infection Z i . If Z i associates with X i , the transition likelihood is informative, and the second classifier would have a better classification performance. However, when Z i is not associated with X i , then little information would be contained in the transition likelihood. Thus, the second classifier would have a similar performance to the first classifier. We consider two scenarios where the association between Z i and X i is either strong or weak. For simplicity, we assume that for each pair of X ij and Z ij , only one external covariate W ij is associated with the transition.

Binary Covariates
For the low-dimensional setting, we set α to be 0, the first 3 components of β to be log(1.5), and the rest of the components to be 0. We generated X i from the Bernoulli distribution with probability P(X ij = 1) = 0.5 exp{−0.1(j − 1)} for j = 1, …, 10. Such a choice of X i and β indicates that the three most prevalent variants are associated with the relapse. We generated failure time T i * based on the all-cause hazard function λ i (t) = λ i1 (t) + λ i2 (t) and then determined whether the infection is a relapse or reinfection by a Bernoulli random variable with success probability equals to exp(β′X i )/{exp(α)+exp(β′X i )}. The right censoring time C i was generated following a uniform distribution between 0 and c, where c is a constant controlling for 20% censoring. The observed time T i is the minimum between T i * and C i . We assume that for any j ⩽ J, there is one external covariate W ij affecting the transition from X ij to Z ij . For each i and j, we independently generate W ij from a uniform distribution between 0 and 1, which is also independent of X ij .
If the event is reinfection, Z i was generated independently from the same distribution as X i . If the event is a relapse, we generated Z i following the transition model (10). We let q j21 = q j21 * = 0.9 in the first scenario when Z i strongly associates with X i , and q j21 = q j21 * = 0.001 in the second scenario when Z i weakly associates with X i . The intercept q j20 was set to be 0.3 for both scenarios. We repeat the simulation 500 times for each combination of n and J under both scenarios. The operating characteristics of the two classifiers are reported in Table 1. Reported values are means and standard deviations over 500 simulations. Table 1 shows that performance of the first classifier I ξ i (0) > 0.5 is similar under both scenarios in terms of sensitivity, specificity, and overall accuracy. This result is reasonable since we only included baseline covariates and time to event information when constructing the first classifier. This information was generated using the same mechanisms under both scenarios. The second classifier I ξ i2 (1) > 0.5 has a better performance than the first classifier I ξ i2 (0) > 0.5 in scenario 1, where sensitivity, specificity, and overall accuracy are all in favor of the second classifier. The classification accuracy gets better when the sample size is larger. In scenario 1, the strong association between Z i and X i makes the transition likelihood much more informative. Therefore, the improvement in the classification performance is obvious in this scenario. However, in scenario 2, the association between Z i and X i is relatively weak. The transition likelihood contains less information in this scenario. Hence, the second classifier improves little upon the first classifier, averaging merely 12%-18% improvement in the overall accuracy, even when the sample size is larger.
When n and J are fixed, we can see that differences in the baseline hazard function λ 0 (t) barely affect the performance of both classifiers. This result is reasonable since the baseline hazard λ 0 (t) is canceled in (13). As long as the proportional hazards assumption stands, the classification accuracy is similar regardless of the true form of the baseline hazard λ 0 (t).
For high-dimensional settings, we set α to be 0, the first 10 components of regression coefficients in β to be log(1.5), and the rest to be 0. The remaining set-up was the same as in the low-dimensional setting. We repeat the simulation 500 times for each combination of (n, J) under two scenarios. The performance of the two classifiers is reported in Table 2.
In Table 2, we can see similar results as in Table 1. The first classifier behaves similarly under both scenarios. In scenario 1, the second classifier has perfect sensitivity and nearly perfect specificity. In scenario 2, the second classifier has similar overall accuracy as the first classifier, with slightly lower sensitivity and slightly higher specificity. The choice of the baseline hazard function λ 0 (t) barely affects the performance.
We also evaluated the accuracy of coefficient estimates θ = α, β ′ ′ for the high-dimensional settings, where the bias of α and variable selection performance of β are reported in Table   2. Since we did not use transition likelihoods when estimating θ, the accuracy of θ is similar under both scenarios. The baseline hazard function was canceled when calculating the partial likelihood function (5). Therefore, it has little influence on the performance of θ One can see that as J gets larger, the bias of α increases. However, the performance of β improves since more variables are selected correctly.
In addition, we compare our classifiers with the classifiers proposed by Lin et al. (2020). We note that Lin et al. (2020) also proposed two classifiers. The first one only uses baseline covariates and the second one uses both baseline and recurrence covariates. However, they do not use time-to-event information. These classifiers require prior knowledge of the reinfection rate, which significantly affects the classification accuracy. The simulation results are provided in Section S1.1 of the Supplementary Materials.

Normally Distributed Covariates
In addition, we simulate for normally distributed X i and Z i . For both low-and highdimensional settings, we consider the same set-up for α and β as in the simulation study for binary covariates. We generated X i and W i independently from a standard normal distribution. The event time T i *, censoring time C i , and observed time T i were all generated with the same strategy as for the binary covariates. We generated Z i based on the event type, following the transition model (11). We let q j21 = q j21 * = 0.9 in scenario 1, where Z i strongly associates with X i , and let q j21 = q j21 * = 0.001 in scenario 2, where Z i weakly associates with X i . We let q j20 = 0.3 and ψ jk = 1 for each j under both scenarios. We repeated the simulation 500 times for each combination of n and J under both scenarios. The performance of two classifiers is reported in Tables 3 and 4 for low-and high-dimensional settings, respectively. We also reported the estimation accuracy and variable selection performance of θ in the high-dimensional settings in Table 4.
In Table 3, the first classifier performs similarly under both scenarios. The second classifier has better performance than the first classifier under scenario 1 but comparable performance under scenario 2. Also, the change of the baseline hazard function λ 0 (t) barely affects the performance of both classifiers. A similar pattern is also observed in Table 4 in highdimensional settings. As for θ, it has similar accuracy with various baseline hazard functions λ 0 (t). However, when J gets larger, the bias of α increases a little, but the performance of β gets better. In summary, our classifiers perform similarly for both binary and normally distributed covariates.

Misspecified Hazard Functions
To evaluate how our classifiers perform when the hazard models in (3) and (4) are misspecified, we choose the cause-specific hazard functions as λ i1 (t) = λ 0 (t) + exp(α) and λ i2 (t) = λ 0 (t) + exp(β′X i ), where λ 0 (t) = exp(τ), where τ = −0.5, 0, 0.5. In this way, the hazards are no longer proportional. We still consider both binary and normally distributed covariates and set all other parameters the transition functions the same as above.
We repeat the same simulation studies for these additive hazard models. The simulation results are shown in Section S1.2 of the Supplementary Materials. We find that the first classifier does not perform well due to the misspecification of the hazard model. However, after incorporating the transition likelihood, the second classifier can still improve the classification accuracy.
6 Plasmodium vivax Malaria Infection Study

Identify the Cause of Recurrence Infections
As discussed in the introduction, it is essential to identify the cause of infection in P. vivax malaria research when the primary interest is treatment efficacy or effectiveness. In this section, we apply our proposed classifier to the P. vivax malaria data described in Section 1.1. We aim to classify the recurrent infection as either reinfection (ϵ i = 1) or relapse (ϵ i = 2). We first fit the cause-specific hazards model (3) and (4) with X i as a vector of binary covariates that indicate whether a haplotype (genetic variant) is present or absent. Parameters θ = (α, β′)′ were estimated via the penalized partial likelihood function (5)  After we obtained θ, probabilities ξ i1 (0) and ξ i2 (0) were calculated based on formulae (6) and (7), respectively. For subjects with a recurrent infection, reading frequency for each haplotype presented at the baseline sequencing of the initial infection is used as the external covariate W i . Here, covariates X i and Z i are binary variables. When the recurrent infection is reinfection (ϵ i = 1), we assume Z i is independent of X i and W i , but follows the same distribution as X i . In this case, ϕ i (1) can be estimated independently without using the pseudo-likelihood function (8), and the distribution of Z i can be estimated using X i alone.
To be specific, for ϵ i = 1, the transition likelihood function ϕ i (1, γ 1 ) can be written as For ϵ i = 2, when the recurrent infection is a relapse, we assume the transition likelihood follows the form of (10), that logit μ ij2 = q j20 + x ij q j21 + w ij x ij q j21 * , with w ij being the reading frequency of the jth haplotype of subject i when the haplotype is presented at the baseline sequencing, i.e., x ij = 1.
We replaced ϕ i (1, γ 1 ) in (8) by ϕ i 1, γ 1 and maximized the pseudo-likelihood function to obtain γ 2 . When using ν = 2.05, we have q 0 = − 1.366, q 1 = 2.738, and q* = 4.317. When the recurrent infection is relapse, the parameter q 0 is the log odds of a subject whose baseline sequencing did not contain haplotype j (x ij = 0) but the follow-up sequencing at the recurrence did (z ij = 1). The estimate q 0 = − 1.366 can be transformed into an estimated transition probability of 0.203, meaning there is 20% chance that the unseen haplotype at the baseline may show up at the recurrence when the cause is relapse. Since q 0 + q 1 = 1.372, it shows that there is around 80% chance of observing a haplotype again at the recurrence (z ij = 1) when the cause is relapse and the haplotype appeared at the baseline (x ij = 1). Since q* = 4.317, it indicates that there is more than 99% chance of observing the same haplotype again at the recurrence (z ij = 1) when the reading frequency of the haplotype is more than 80% at the baseline (w ij = 0.8). When using ν = 0.8, we have q 0 = − 1.323, q 1 = 2.506 and q* = 4.284. These estimates are similar to those when using ν = 2.05 and can be interpreted analogously.
Finally, we calculate ξ ik (1) by (9) for k = 1, 2 and classify the recurrent event as relapse if (1) and reinfection otherwise. Table 5 contains the classification results for the 23 subjects with recurrent infection based on our proposed method using ν = 2.05. The tables include days to recurrence, baseline and recurrence haplotypes, the estimates β, recurrence haplotype prevalence, two classification probabilities, and classification results from Lin et al. (2020), which analyzed the same data without utilizing the time to event information and external covariates in the estimation of transition likelihoods.
Our proposed method classifies 3 out of 23 recurrence pairs differently from Lin et al. (2020). The first pair is 87 → 87R, which was classified as relapse by Lin et al. (2020) but as reinfection by our classifier. Five variants showed up at the baseline sequencing, of which only CAM.00, the haplotype with the highest prevalence, showed up again in the recurrence sequencing. Also, the days to recurrence for this pair are 81 days, which is a relatively long time for relapse, suggesting that this recurrence event is more likely to be reinfection. The second pair is 123 → 123R, which was classified as reinfection by Lin et al. (2020) but as relapse by our classifier. Two haplotypes (CAM.00 and CAM.02) were observed at the baseline sequencing, and haplotype CAM.00 showed up again at the recurrence sequencing with CAM.01. Since only two haplotypes appeared at the recurrence, and CAM.00 is the most prevalent variant, the recurrent infection looks more likely to be a reinfection if not taking time to recurrent into consideration. However, the recurrent infection occurred only 26 days after the initial infection, which is a relatively short time compared to other reinfection cases. The only case classified as reinfection with a recurrent time less than 26 days was pair 160 → 160 R, with only 17 days to recurrence, but this is reasonable since there is no overlap between the baseline and recurrence variants. Notably, the pair 123 → 123R has 96% CAM.00 in the reading frequency at baseline, which supports the classification as relapse due to a high likelihood of observing the same variant in relapse if the variant has a high reading frequency at baseline, as suggested by large q*. The last disparity comes from pair 153 → 153R, which was classified as relapse by Lin et al. (2020) but as reinfection by our classifier. There is no overlap between initial and recurrence variants. The time to recurrence is 115 days, which is longer than any case that was classified as relapse. The only case with days to recurrence longer than this pair is pair 151 → 151R, which was classified as reinfection by both Lin et al. (2020) and our classifier. Therefore, it is more reasonable to classify pair 153 → 153R as reinfection.
Overall, by considering the time to event and baseline haplotype reading frequency, our classifier achieves more consensus in this study.

Model Diagnosis and Sensitivity Analysis
In this specific study, we assume that the cause-specific hazard functions are proportional to a baseline hazard function. Next, We verify such an assumption for the P. vivax malaria data using the martingale residuals method proposed in Section 2. We carry out the model diagnosis as follows. For a sequence of x in the range of the linear predictor β ′ X i , we calculate the test statistic T (x) = ∑ i = 1 n I β ′ X i ⩽ x M i , where M i is the martingale residual defined in Section 2. Using a Monte-Carlo simulation with Q i (i = 1, …, n) sampled independently from the standard normal distribution, the confidence band for T(x) can be constructed by calculating T Q (x) = ∑ i = 1 n I β ′ X i ⩽ x M i Q i . We simulate the process of T(x) by repeating the sampling. Using ν = 2.05, the linear predictor β ′ X i ranges from 0 to 1.94. Misidentification of unique haplotypes is a concern in the current analysis. Low-frequency minority genetic variants that only differ in sequence by one nucleotide base pair to common variants may represent false haplotypes generated by sequencing error. We adjusted the stringency of criteria used for calling haplotypes to "collapse" such variants together, reducing the total number of 67 unique haplotypes to 32 (Hathaway et al., 2018). As a sensitivity analysis, we also analyzed the data with this total number of 32 haplotypes, based on collapsing variants with 1-nucleotide apart within the same isolate. The classification result has several disparities with the one using 67 haplotypes but mostly agreed with the one based on the method in Lin et al. (2020) using 32 haplotypes. It is not surprising to find the classification result sensitive to the identification of haplotype since our method relies on the modeling of the transition between variants. The collapse of variants and corresponding classification results using 32 haplotypes are provided in the Supplementary Materials.

Discussion
We proposed a classification method for identifying the latent cause of events under competing risks set-up, which utilizes both time to event and transition likelihood information for better classification performance. By considering the transition likelihood, we utilize more information when constructing the classifier, which leads to better performance than the classifier using only baseline information. The method can be applied regardless of the true form of the baseline hazard function, and can also be applied to a variety of covariate data types. We examined the performance of our method through simulation studies under various settings as well as real data analysis, which shows high reliability of our method.
When modeling the outcomes of competing risks, we assumed a proportional hazards model with a common baseline hazard function for every cause-specific hazard. When the hazards share the same covariates, the model may not be identifiable. To avoid the identifiability issue when analyzing the P. vivax malaria data, we assume the reinfection process is independent of any baseline covariates in X i but has a hazard function proportional to a baseline hazard λ 0 (t). This assumption is reasonable for our data but may not be ideal for a general case. Alternative approaches for the estimation of the hazard functions call for more investigation. In our current approach, we assume the transition of covariates is independent of time. It will be of interest to generalize the transition model to be a function of time. A possible approach is to include time t i as a covariate in the model for μ ijk . This approach is somehow restricted to a linear function of time, which is subject to model misspecification.
The statistical inference of regression coefficients β is also a topic worth investigating. While the current method can perform variable selection on β with high accuracy, inferring the significance of these selected variables needs more work. We also remark that if one would like to evaluate the treatment efficacy using our approach, they can include treatment as a covariate in (4). Then, using the same penalized partial likelihood method as shown in (14), they can estimate the coefficient corresponding to the treatment for the treatment efficacy.
Finally, we point out that due to the nature of the P. vivax malaria, causes for recurrence are often unobservable. This problem motivates us to develop the classification method for totally unobservable causes. For other applications, when causes may be partially observed, one can plug their cause-specific hazards into the partial likelihood function (1) for subjects with observed causes and treat the rest causes as missing data. Then, EM algorithms may be utilized to obtain θ, based on which one can still build the two proposed classifiers. It will be of great interest to study the estimator's efficiency improvement by the transition likelihood in the future study. Goodness-of-fit model diagnosis for the P. vivax malaria data using ν = 2.05.   Classification and variable selection performance of proposed classifiers with high-dimensional binary covariates.    Classification performance of proposed classifiers with high-dimensional continuous covariates.