Developing Multivariate Survival Trees with a Proportional Hazards Structure

In this paper, a tree-structured method is proposed to extend Classification and Regression Trees (CART) algorithm to multivariate survival data, assuming a proportional hazard structure in the whole tree. The method works on the marginal survivor distributions and uses a sandwich estimator of variance to account for the association between survival times. The Wald-test statistics is defined as the splitting rule and the survival trees are developed by maximizing between-node separation. The proposed method intends to classify patients into subgroups with distinctively different prognosis. However, unlike the conventional tree-growing algorithms which work on a subset of data at every partition, the proposed method deals with the whole data set and searches the global optimal split at each partition. The method is applied to a prostate cancer data and its performance is also evaluated by several simulation studies.


Introduction
In recent years, many efforts have been made in the application of treestructured methods to the censored data.The thrust of tree-structured methods comes from the desire to extract meaningful subgroups.Although proportional hazard model (Cox 1972) is a very useful tool in quantifying the effects of covariates on survival time, it is usually problematic for prognostic classification.For example, we may create prognostic groups based on "risk score" which is calculated from linear combinations of coefficients and covariate values.However, this approach may be problematic: (a) the "risk score" depends on the correct specification of the model structure, which may be hard to check when many covariates are involved; (b) the regression model becomes difficult to interpret for the purpose of classification when there exist many interaction terms; (c) the resulting risk groups are arbitrary, and (d) sometime a risk group becomes empty because no patients are included in some covariate profiles.
In contrast, tree-structured methods recursively split the covariate space and directly lead to homogeneous subgroups.Tree-structured method was first proposed by Morgon and Sonquist (1963) and became popular due to the monograph Classification and Regression Tree (CART) by Breiman et al (1984).Since Gordon and Olshen (1985) gave the first adaptation of CART algorithm to censored data, a rich amount of literature has been dedicated to survival data.Davis and Anderson (1989) grew trees by assuming survival times to be exponential within a given node.LeBlanc and Crowley (1992) proposed a method to grow trees via an one-step full log-likelihood estimation.Huang, Chen and Soong (1998) extended Davis and Anderson's method to the time-dependent case, assuming the survival times to be piecewise exponential.The above methods grew trees rewarding within-node homogeneity.Segal (1988) presented a method to grow trees by goodness of split.Instead of maximizing within-node homogeneity, Segal's method maximized between-node separation, using log-rank test statistics as the splitting rule.Bacchetti and Segal (1995) extended Segal s method to survival data with left-truncation and time-dependent covariates.LeBlanc and Crowley (1993) developed an efficient pruning algorithm for trees that maximized between-node separation.
All these methods were proposed for univariate survival data which assumes survival times to be independent.However, this assumption may be violated in multivariate (clustered) survival data.In the ophthalmology studies, for example, it may be inappropriate to assume that the survival patterns of the two eyes from the same subject are independent.There are very few published tree methods regarding correlated survival data.Gao, Manatunga and Chen (2004) and Su and Fan (2004) extended CART algorithm to multivariate censored data in a similar manner.Both of them introduced a gamma distributed frailty to account for the dependence among survival times.The former used Wald-test statistics as the splitting function while the later was based on a likelihood ratio test.Though survival trees developed above were computationally convenient, they suffered a potential drawback that the overall model structure was unclear.When a partition was done, for example, the above methods assume hazards within the two daughter nodes to be conditionally proportional given the unobserved frailty, but the hazards structures for any two nodes from different parents are unclear.In this paper, we propose an algorithm that develops multiple survival trees assuming a proportional hazards structure within the whole tree.The proposed method can be viewed as a hybrid of CART algorithm and WLW model (Wei, Lin and Weissfeld 1989).Instead of fitting a model for formal inference, it automatically develops a set of prognostics groups while assuming a proportional hazards structure.This paper is organized as follows.In section 2, we describe the algorithm for classification.In section 3, we apply the method to a prostate cancer data.We assess the performance of the method by simulations in section 4. Finally, we conclude with a discussion in section 5.

Methodology
The method is proposed for multivariate (clustered) survival data where several items (or units of system) are followed simultaneously and we only consider the data with a fixed cluster size.The simplest clustered data to consider is the pair of survival times observed in two eyes in an ophthalmologic study.Suppose that data includes N clusters (subjects) and each cluster includes J items (eyes).Let i = 1, 2, • • • , N index clusters and j = 1, 2, • • • , J index the items within each cluster.Let T ij = min(X ij , C ij ) be the observed time for the j th item in the i th cluster, where X ij is the true failure time and C ij is the true censoring time.Let δ ij be the corresponding survival indicator with δ ij = 1 if X ij ≤ C ij and δ ij = 0 otherwise.In this paper we assume that the covariate vector is common to items within the same cluster and let Z i denote the covariate vector for i th cluster.Then, the data can be denoted as We further assume that the survival time and censoring time are independent.Given covariate Z, the survival times between clusters are independent and the survival times within the same cluster may be dependent.
The proposed method intends to extract meaningful subgroups via a tree-like algorithm.Similar to the procedure in Gao et al (2004), trees are developed to maximize between-node separation as measured by a Wald-test statistic and the partition is performed on a single covariate at a time.A large tree is developed first and Segal's pruning method (1988) is adopted to determine the proper tree size.Finally a marginal Kaplan-Meier survival curve is prepared separately (i.e., for each eye) and thus to summarize the performance of the classification.However, unlike other tree-growing algorithms which work on a subset of data at a partition, the proposed method deals with the whole data set at each split and assumes a global proportional hazard structure in whole tree.
The proposed method models on the marginal distributions of survival times, specifying the hazard function for j th event to be, where h 0j (.) is the baseline hazard function for j th event and g( .) is an arbitrary non-negative function of covariate vector.The classification will be performed in a stepwise manner.Let Γ k denote the k subgroups obtained at k th step.We assume that these subgroups are consisted of two disjoint subsets A and B, where that cannot be further split, with k = k 1 + k 2 .In terms of non-splitting subgroup, we mean that either the covariate space within the subgroup becomes homogeneous or the sample size (number of events) of the subgroup is less than a pre-specified value.Suppose that, at (k + 1) th step, a subgroup γ ∈ A is further split, and let Γ k+1 represent the k + 1 subgroups at current step.For mathematical convenience, we rearrange the subgroups as and C k+1 being the reference group and the two newly obtained groups respectively.The reference group will be arbitrarily chosen (say, B 1 ) from subgroups that do not permit further splitting.Since the splitting statistics (see below) is based on the relative difference of the two new subgroups (C k and C k+1 ) rather than their absolute effects, the comparisons among candidate partitions will be valid as long as the two new subgroups have a common reference level.
In order to rate over all possible splits, the following working model is fitted, and the first subgroup is set to be the reference level.The unknown parameters can be estimated by maximizing the following partial likelihood function (Wei, Lin and Weissfeld 1989), where Y lj (t) is an indicator for whether j th item in l th cluster is at risk at time t, with Y lj (t) = 1 if t lj ≥ t and 0 otherwise.The corresponding score function and the information matrix of β respectively are, where

and S
(2) 2) is essentially a WLW model (Wei, Lin and Weissfeld 1989) with a set of indicators for the resultant subgroups.The estimator β will be consistent for β but A −1 (β) may not provide a valid variance-covariance estimator (Lin and Wei, 1989).Then, a "sandwich" estimator (Lin and Wei, 1989) can be used to account for the withincluster association, is the information matrix of β evaluated at β.The splitting function to evaluate and compare any candidate partitions is }, the robust Wald-test statistic for testing the hypothesis whether the two new subgroups are different H 0 : β k = β k+1 .We note that, as in all tree-based methods, the models are developed conditionally in that the latter split is always conditional on previous ones.In such a case, the asymptotic distributions of estimated parameters become unclear and a formal statistical inference may be problematic.However, Φ still quantifies the separation between the two new subgroups and can be used to rate over the candidate splits.The best partition is chosen such that Φ is maximized among all candidate splits in subgroups that permit further partition.In the special case where all terminal nodes permitting further splitting (the set B is empty), first we need to randomly pick up one node from set A (A 1 , say) as the reference group to evaluate the splits among all other subgroups.Then, we fix one of the other subgroups as the reference to evaluate all possible splits within node A 1 .
In summary, the proposed method incorporates a tree-like algorithm into the conventional WLW model.At each step all competing splits are evaluated, and the one maximizing the splitting function is chosen and executed.As is conventional practice in tree-based methods, we choose a conservative stopping criterion to avoid missing important data structures.The partition process is applied repeatedly until either all the subgroups can not be further split or Φ ≤ 0.45, which corresponds to 50% percentile of an X 2 distribution with 1 degree of freedom.As mentioned previously, however, the asymptotic distributions of estimated parameters become unclear due to the conditionality of tree-based methods.The choosing of 0.45 is rather arbitrary and could be adjusted by readers to avoid a tree with too large size.Usually a large tree will be developed using such a conservative stopping criterion, and then we select the tree with a proper size by adopting Segal's pruning algorithm (1988) detailed below.We initially grow a very large tree and identify all possible branches of the tree in a bottom-up approach.Within each branch, we find the maximal splitting statistic that reflects the strength of the link of that branch to the tree.We then rank these maximal statistics and prune off the branch with the smallest value (the branch with the weakest link).Thus, we get the first pruned tree plus its weakest linking statistic.The second tree can be obtained by applying this process to the first pruned tree.Repeating the above steps until the pruned tree contains only the root node, we obtain a sequence of nested trees together with their weakest linking statistics.Finally we plot the weakest linking statistics against the tree sizes.Because a plateau in the curve indicates that the strength of the link does not decrease as the tree size increases, we choose the best tree as the one corresponding to the characteristic "kink" in the curve.

Application: Prostate Cancer Data
Disease-free survival rate is an important measurement in the evaluation of treatment effect for prostate cancer patients.Currently there are two types of definitions for disease freedom in prostate cancer treatment.One specifies a non-detectable level of prostate specific antigen (PSA) and the other, by the American Society of Therapeutic Oncology (ASTRO), defines a non-rising PSA level.From August 1992 through October 1997, 537 patients with prostate cancer were treated by transperineal I-125 implant and followed up for 3 to 8 years, with a median follow-up of 6 years (Critz 2002).The outcome of interest for this study is the time to recurrence.Two outcomes were defined using followup PSA levels.One records the time from treatment to the detection of ≥0.2 ng/ml PSA level.The other is the time to 3 consecutive increasing in PSA as measured 6 months apart.At the end of study, 81 patients recurred based on the definition of undetectable PSA and 59 patients recurred according to ASTRO definition.Three clinical characteristics were also collected: pre-treatment PSA (PrePSA), clinical stage, and Gleason Score.All of them were ordinal covariates with 4 categories (Table 1).Eight patients were excluded due to missing values in Gleason Score and these 529 patients constituted the cohort for our study.The original goal of the study was to check the agreement between the two definitions for recurrence.In this paper, we are interested in identifying risk groups for recurrence.For comparison, we first analyze the data by WLW model (Wei, Lin and Weissfeld 1989), treating the first category of each covariate as the reference level.The results show that those patients being in stage T2B and T2C have higher risk of recurrence, and those patients with higher level of pre-treatment PSA or larger values of Gleason Score are more likely to experience recurrence (Table 2).Though the above model is very useful in quantifying prognostic effects, it does not directly lead to risk groups.Next, we apply the proposed tree-structured method to this data and set the minimal size of the terminal nodes to be 20 failures.The method first splits on whether pre-treatment PSA is larger than 10 or not.Then Gleason Score (> 6 versus ≤ 6) and clinical stage (T1A-T2A versus T2B-T2c) are chosen as the best partitions respectively.After 5 partitions, an original tree with 6 terminal nodes is developed (Figure 1(a)), where circles and squares represent internal nodes and terminal nodes respectively, and Φ denotes the splitting function for each partition.After pruning process, we end up a final tree with 4 terminal nodes which lead to 4 risk groups (Figure 1(b)): PrePSA ≤ 10.0 and Gleason Score ≤ 6 (Group A, node 6), PrePSA ≤ 10.0 and Gleason Score > 6 (Group B, node 7), PrePSA > 10.0 and clinical stage being T1A-T1C or T2B (Group C, node 4), and PrePSA > 10.0 and clinical stage being T2B or T2C (Group D, node 5).Because larger values of covariates associated with higher risk of recurrence, we anticipate that patients in Group A have the lowest recurrent risk and patients in Group D have the highest risk.The marginal Kaplan-Meier plots of these 4 prognostic groups are presented in Figure 2, and the curves show that these subgroups are well separated in each definition for disease-freedom.The risks of recurrence are in the order of Group A < Group B < Group C < Group D.

Simulation Studies
Several simulation studies are performed to assess the proposed method.Data sets are generated from the bivariate Clayton-Oakes model (Oakes 1982), with the joint survivor function S(t 1 , t 2 ) = P r(T 1 > t 1 , T 2 > t 2 ) given by, where G(t 1 ) = P r(T 1 > t 1 ) and H(t 2 ) = P r(T 2 > t 2 ) are marginal survivor functions for survival times T 1 and T 2 respectively, and the parameter θ governs the association between T 1 and T 2 .As θ → 1+, S(t 1 , t 2 ) → G(t 1 )H(t 2 ) and the two survival times become independent.As θ → ∞, S(t 1 , t 2 ) → min(G(t 1 ), H(t 2 )) and the bivariate distribution shows maximal association.In the simulation studies, different values of θ are set to introduce the model a low to moderate association, indexed by Kendal's τ =0.2, 0.4, and 0.6 respectively.
In the simulations, for simplicity we consider G(.) = H(.) and assume the hazard function to be where the baseline hazard h 0 (t) is chosen such that the baseline survivor function has a Weibull distribution S 0 (t) = e −(ηt) κ with a shape parameter κ = 1.5 and a scale parameter η = 1.The function g(z) is defined as g(z) = e β 1 I(z 1 >2)+β 2 I(z 2 >1) , with β 1 = 1.0, β 2 = 0.75 and I(.) to be the indicator function of covariates.Therefore, the above model leads to 4 prognostic groups.To have a better evaluation of the ability to correctly identify these prognostic groups, two extraneous variables (z 3 and z 4 ), which are not related to the outcome, are also included into the data.Among these 4 covariates, z 1 and z 4 have discrete uniform distributions taking values from 1 to 5, and z 2 and z 3 are dichotomous variables which take values 1 and 2 with the same probability.In addition to the complete data, censoring times are also introduced independently from a uniform distribution [0,u].Approximately 30% censoring is introduced by adjusting the value of u.Each simulation has 1000 repetitions and each repetition consists of 400 pairs of survival times.The simulations are implemented via statistical package R (version 1.5).In simulation studies, the minimal size for a terminal node is set to be 40 failures.The final trees are categorized according to their ability to identify correct data structures.A tree will be classified as "correct" if all 4 prognostic groups are identified as the terminal nodes of the pruned tree.A tree will be classified as "over sized" if all 4 prognostic groups are included in the tree, but not all of them are terminal nodes of the tree.A tree will be classified as "partial recognized" if only 2 of the 4 prognostic groups are included in the tree.Otherwise the tree will be classified as "failed".Table 3 summaries the simulations, where τ reflects the correlation between survival times and a larger value of τ is associated with a higher correlation.The results show that the proposed method performs quite well in identifying correct data structures.In all the simulations, the prognostic groups in nearly 90% of the simulated data are correctly identified.However, the capability of identification decreases slightly in the presence of censoring.Because the power to detect the difference decreases as the correlation increases (Lee, Wei and Amato 1992), the capability of identification also decreases as the correlation increases.

Discussion
Tree-structured methods have provided a superior way for prognostic classification.Under classical CART algorithm, trees are grown by maximizing withinnode homogeneity which is usually measured with log-likelihood.In multivariate survival data, writing a log-likelihood function is problematic due to the unknown marginal survivor functions as well as the presence of association among events.In contrast, the idea of growing trees rewarding between-node separation by Segal (1988) can be extended to develop survival trees in multivariate survival data.In the two survival trees by Gao et al (2004) and Su and Fan (2004), for example, a gamma distributed frailty was introduced to account for the dependence among survival times and the trees were developed maximizing relative risks as indexed by Wald test and likelihood ratio test respectively.In trees constructed by conventional methods, all the resultant subgroups are completely unrelated.However, this may not be true for the above multivariate survival tree because the two daughter nodes share a common baseline hazard function.That is, two components (relative risk and baseline hazards) are involved in the model, but only one (relative risk β) is used for tree construction and the other (baseline hazards) is left unspecified.To address this problem, we advocate a method to develop classification trees under proportional hazards structure.The proposed method works on marginal distributions of the survival times, and uses the "sandwich" estimator of variance to account for within-cluster association.The splitting function is based on the robust Wald-test statistics of WLW model (Wei, Lin and Weissfeld 1989) which is fast in computation.The proposed method can be used as a complimentary tool to the usual parametric or semi-parametric models.The method is useful in preliminary data analysis to identify the potential data structures, thus facilitating a better understanding of the survival data and a more efficient statistical modeling.We evaluate the method by several simula-tion studies.The results show that the method performs quite well in prognostic classification.In all the simulations, data structures in nearly 90% data sets can be correctly identified.
The proposed algorithm for multivariate survival trees shares some inherent features of the univariate survival tree methods.Therefore, it is straightforward to further extend this algorithm in a variety ways, i.e., incorporating surrogate splits (Breiman et al, 1984) to handle missing values in covariates or using boosting and bagging method (Hothorn et al 2004) to stabilize its performance.Comparing to the conventional WLW model (Wei, Lin and Weissfeld 1989) or the trees developed via frailty model (Gao et al 2004, Su andFan 2004), the proposed method possesses some unique advantages.First, instead of presenting an algebraic formula, this method provides a set of subgroups directly.In addition, the prognostic groups are developed within the framework of proportional hazards model.Instead of assuming a locally proportional hazards structure, this algorithm assumes a proportional hazards structure within the whole tree and thus presents a clear model structure.As a consequence, a global optimal split is obtained at each partition because the best split is searched over the whole tree.Though the above advantages are gained in the price of intensive searching, computational time has become a minor issue with the availability of high performed personal computers nowadays.

Figure 1 :
Figure 1: Multivariate survival trees for prostate cancer data, where Φ is the splitting statistics quantifying the separation between two subgroups and N represents the number of patients in the resultant prognostic groups.

Figure 2 :
Figure 2: Kaplan-Meier plots for the 4 prognostic groups derived from the survival tree (A separate plot was prepared for each definition of disease freedom)

Table 1 :
Clinical characteristics for the patients with prostate cancer

Table 2 :
Results of marginal model analysis for prostate cancer data

Table 3 :
Simulations investigating the capability in identifying data structures *: Kendall's τ , which index the within-cluster association