Nonparametric Multiple Imputation of Left Censored Event Times in Analysis of Follow-up Data

In this paper, we consider analysis of follow-up data where each event time is either right censored, observed, left censored or left truncated. In the case of left censoring, the covariates measured at baseline are considered as missing. The work is motivated by data from the MORGAM Project, which explores the association between cardiovascular diseases and their classic and genetic risk factors. We propose a nonparametric multiple imputation (NPMI) approach where the left censored event times and the missing covariates are imputed in hot deck manner. The left truncation due to deaths prior to baseline is compensated by Lexis diagram imputation introduced in the paper. After imputation, the standard estimation methods for right censored survival data can be directly applied. The performance of the proposed imputation approach is studied with simulated and real world data. The results suggest that the NPMI is a flexible and reliable approach to the analysis of left and right censored data.


Introduction
We consider data from a follow-up study where a group of subjects (hereafter, a cohort) has been followed up to a fixed calendar period for fatal and non-fatal cardiovascular events starting from the baseline examination.Our objective is to model the effect of some covariates on the risk of coronary heart disease (CHD).The time of the first event is recorded for each subject using the age of subject as the time scale.If the first event is non-fatal, the follow-up for death continues also after the event.If no events have been occurred by the end of the follow-up, the event time is considered as right censored.If an event has occurred before the baseline examination, the event time is considered as left censored.Thus, there are three possibilities for each subject in the cohort (observed data): 1. there are no events neither during the follow-up nor before the follow-up (right censoring).
2. an event occurs during the follow-up and there has been no events before the beginning of the follow-up (event time is observed).
3. there has been an event before the beginning of the follow-up but we do not know the time when the event took place (left censoring).
Left censoring arises because it is recorded at baseline if there have been CHD events in the past but the data on the event times are not available.The event time X is observed only if it is smaller than or equal to the censoring time C and greater than B, the age of the subject at baseline.Consequently, the observed time T is obtained as a function of X, C and B T = min(max(X, B), C).
The observed data can be divided into four sets: right censored observations R 0 , non-fatal events during the follow-up R 1 , fatal events during the follow-up R 2 and left censored observations of non-fatal events R 3 .Let the numbers of the observations in these sets be n 0 , n 1 , n 2 and n 3 , respectively.
The analysis of left censored observations requires that we change the followup to start from an age prior to baseline examination.This leads to the problem that, by definition, the cohort cannot have members with a fatal event before the baseline examination, and therefore the cohort followed up e.g. from age 25 is not comparable with the cohort followed up after the baseline examination.In other words, we are dealing with left truncation.Fortunately, it turns out that we can use the observed data to compensate for the potential deaths before the baseline examination.The left truncated observations can be divided into three groups.Set R 4 contains subjects who had a fatal CHD event before the baseline examination as their first event.Set R 5 contains subjects who first had a non-fatal CHD event and then later died before the baseline examination.Set R 6 contains subjects who died before the baseline examination without any preceding CHD events.Both the sets of left truncated subjects R 4 , R 5 and R 6 and their numbers, n 4 , n 5 and n 6 , respectively, are completely unobserved.
The covariates measured at baseline examination may be divided into two categories: Sex and genes are examples of permanent covariates that do not change as a function of time.Cholesterol level, blood pressure, smoking and body mass index (BMI) are covariates that do change in time although they are often treated as constant in cohort studies.In particular, they are likely to change substantially after a CHD event due to intervention to prevent recurrent events.Therefore the values of the covariates measured after the event are influenced by the event and cannot be considered as risk factors for this event.Consequently, if the event time is left censored, the time-varying covariates must be taken as missing.We use G to refer to permanent covariates and Z to refer to time-varying covariates.
Under the assumption of non-informative censoring, the log-likelihood function related to left truncated and left and right censored data may be expressed in a general form as where z i and g i refer to observed covariates, Z i and G i refer to unobserved covariates, and F (t) and f (t) are the cumulative distribution function (cdf) and the probability density function (pdf) of the event time, respectively.The different subject groups are summarized in Table 1.Our primary interest in this paper is to estimate the effect of the permanent covariates on the event times without bias and as accurately as possible.It is assumed in this paper that the reader is familiar with the standard methods for the analysis of right censored survival data.
The described setup is motivated by data from the MORGAM Project (Evans et al., 2005).MORGAM is a large international project on cardiovascular epidemiology that pools follow-up data from several cohorts.Currently 22 centers (mostly from Europe) are involved and the pooled database contains more than 140 000 subjects.The objective of the MORGAM Project is to explore the association between cardiovascular diseases (CVD) and their classic and genetic risk factors.Population cohorts, examined at study baseline, are followed up for fatal and non-fatal CVD events.The first occurrence of CHD is one of the main endpoints of the study.The MORGAM cohorts contain also subjects who had their first non-fatal CHD event before the baseline examination.For these subjects the exact event times are unknown.Although in some cases it was possible to find the exact event times e.g. from the hospital records, the cost of the additional data collection would be rather high.The percentage of subjects with baseline CHD in MORGAM cohorts varies from 0.5 % to 13 %, which is a considerable proportion when compared to the percentage of first incidence of CHD during the follow-up that varies from 0.5 % to 17 %.Hence, the inclusion of the baseline CHD cases in the time-to-event analysis would significantly increase the number of events and provide more information on the relatively young subjects.The use of the baseline CHD cases suits well for the analysis of genetic risk factors because genotypes, contrary to many other risk factors, cannot be affected by a preceding CHD event.An illustration of a typical study design is presented in Figure 1.
In this paper, a nonparametric multiple imputation (NPMI) approach is proposed to handle the left censored and left truncated event times.In the NPMI approach, each left censored observation is replaced by several imputations drawn from empirical distribution of observed non-fatal event times.Missing covariates are imputed together with the event time.The use of the Bayesian bootstrap weights guarantees the sufficient variation between imputations.The left truncated observations are imputed as well using a novel approach that we call Lexis diagram imputation.After imputation, the standard estimation methods for right censored survival data can be directly applied.The NPMI approach follows the general idea of the multiple imputation introduced by Rubin (1987) but the essential difference is that the left censored observations are partially observed.The proposed approach can be characterized as multiple hot-deck imputation (Levy, 1998) where the set of donors is conditional on the left censored event time.
Several authors have used multiple imputation in survival analysis and public health studies.Multiple imputation of interval censored data is studied in (Pan, 2000;Glynn and Rosner, 2004;Geskus, 2001;Pan, 2001).Additional examples on the use of multiple imputation can found in (Zhou et al., 2001;Taylor et al., 2002;Mishra and Dobson, 2004).The main difference between these works and the approach proposed in this paper is that the NPMI does not use a parametric model for the imputation.We also impute observations that are completely unobserved.
Instead of imputation, we could, at least in principle, construct a parametric model for left and right censored data and estimate the model parameters using Bayesian approach or EM-algorithm.Nevertheless, it is not self-evident how left truncation should be taken into account in these models.If we forget the left truncation, the described setup can be seen as a special case of interval censored data where the observed intervals have form [0, t] (left censored), [t, t] (observed event) or [t, ∞] (right censored).Examples on the analysis of interval censored data can be found e.g. in (Kim et al., 1993;Zhao et al., 2005;Komarek et al., 2005).Alioum and Commenges (1996) proposed a method for estimation of proportional hazards model under censoring and truncation.The method is an extension of Turnbull's nonparametric maximum likelihood estimator (Turnbull, 1974) and may suffer from identifiability problems.Our motivation to propose an imputation based solution is fourfold: First, the NPMI provides a straightforward way to deal with left censoring, left truncation and covariates not missing at random.Second, in exploratory analysis of a great number of potential covariates, the computational speed of the NPMI is an important practical benefit.Third, the use of the NPMI is not restricted to proportional hazards model.Fourth, the NPMI works as a benchmark for more complicated models.
The paper is organized as follows.The NPMI approach is introduced in Section 2. Simulation studies comparing imputed event times and covariates with their true values are presented in Section 3. Regression estimates from the NPMI approach and from the analysis of baseline healthy subjects are compared as well.A real world example using MORGAM data is presented in Section 4. Section 5 concludes the paper.

Overview
In this section, we introduce a nonparametric multiple imputation (NPMI) method for the analysis of left and right censored data.Each imputation round contains generation of Bayesian bootstrap weights, imputation of left censored event times and Lexis diagram imputation of left truncated observations.The left censored event times are imputed by several values drawn from their empirical distributions.The imputation is carried out in hot deck manner selecting the donors conditionally on age at baseline examination and estimated lifetime.The missing covariates are imputed simultaneously by the covariates of the chosen donor.In Lexis diagram imputation of left truncated observations, we first generate the number of missing subjects from the Poisson distribution and then draw a random sample from all observed deaths.The sampling weights are proportional to the unobserved time in the Lexis diagram divided by the follow-up time.

NPMI with non-fatal events only
First we consider a simplified situation where all subjects are followed up at least to the age b max = max i b i and all events are non-fatal.We observe the age at baseline examination b ≤ b max and want to impute the left censored event time X.To do this we consider all subjects who had their first event during the follow-up and before age b.
b} be the set of these subjects.If we wish our imputation scheme to be proper, each imputation round must be started with the Bayesian bootstrap (Rubin, 1987).The Bayesian bootstrap assigns a random weight w i for each observation in the cohort.The weights are generated by taking differences of ordered uniform random numbers.More precisely, if the sample size is n, we generate n − 1 numbers uniformly distributed on the interval [0, 1], sort them, include the endpoints 0 and 1, and calculate the differences of consecutive numbers.The donor j is randomly chosen from set Q with probabilities proportional to the weights w i .The imputation for the missing event time X will be x j and the covariates z are replaced by the covariates z j .The same weights are used for all left censored observations in one imputation round.The imputation transforms the data into right censored survival data and the standard analysis methods for such data are applicable.Imputation modifies the follow-up period to start from the age b min = min i b i for all subjects.After each imputation round, we fit a survival model to imputed data without the bootstrap weights and store the estimates βk , where k is the number of the imputation round.The number of imputation rounds can be as small as K = 5 but moderate values such as K = 20 are preferable if we are also interested in estimating variance of β reliably.After all imputation rounds, estimates β1 , β2 , . . ., βK are combined where the combined variance is the sum of within-imputation variance and betweenimputation variance.The formulae for the combined mean and variance are the same as those routinely used for multiple imputation of missing data (Rubin, 1987).
The imputation need to be stratified by all relevant permanent covariates.This is done simply performing the imputation independently for each subgroup defined by strata.Stratification by continuous covariates requires that they are suitably categorized.If the sample size is small, it is necessary to keep the number of subgroups small in order to guarantee that there are eligible donors in each subgroup.
The use of hot deck imputation implies that the imputed event times cannot be smaller than the smallest observed non-fatal event time x min = min x i , i ∈ R 1 in the data.Therefore, successful imputation requires that the observed event times include also young subjects.If the data contains left censored observations with t < x min , we recommend that they are excluded from the analysis.A large number of such observations in a data set would indicate that the NPMI is not an appropriate analysis method for the data set.

NPMI with fatal and non-fatal events
Next we consider a more realistic situation where some events are fatal and the follow-up times are shorter implying that some subjects are not followed up to the age b max .The imputation procedure for left censored observations is essentially the same as in Section 2.2 but the expected remaining life times must be estimated for the subjects that are withdrawn alive from the study.In addition to the notation defined above, we use D to indicate the time of death.Our cohort is sampled from the population that is alive at baseline, i.e. it always holds D ≥ T .The right censoring has now two possible reasons: death, when D = C, or the end of follow-up for any other reason, when D > C. For each subject, age at baseline b i , censoring time c i and the time of death d i (if it is not after c i ) are recorded in addition to observed time t i and type of event.If a subject was withdrawn alive from the study, the time of death is not known but the vital status at the end of follow-up is still known.For the estimation, we need a working assumption that given the age at censoring, the time from the first event does not have an impact on the remaining life time.Statistically, D − C and C − X are assumed to be independent given C. The data can be used to estimate the probabilities to live an additional year on the condition that an event has occurred in the past: (2.3)Estimated probabilities for survival of v additional years are obtained by chain calculations (2.4) The imputation of left censored observations is carried out similarly to the simplified situation but the sampling probabilities for selecting the donor j are proportional to the product of the weight w i and the estimated survival probability P ( c i , b ), where the notation • refers to the full years of the age.

Lexis diagram imputation
When analyzing real world cohorts we also have to take into account that there are subjects who are excluded from the cohort due to a fatal CHD event or death for other reason prior to baseline examination and are therefore completely unknown (left truncation).In this paper we consider a novel approach where the missing subjects are imputed to the data.The approach can be motivated by a Lexis diagram (Keiding, 1998) and is therefore called Lexis diagram imputation.An illustration of the idea can be seen in Figure 2. The left panel of Figure 2 presents deaths that are observed during follow-up and deaths that are not observed because they occurred before the baseline examination.The three types of unobserved deaths correspond to the sets R 4 , R 5 and R 6 , and the three types of the observed deaths correspond to the sets R 2 , R D13 and R D0 .T he right panel of Figure 2 shows the geometry of the Lexis diagram.We observe the deaths inside the follow-up parallelogram ABCD and use them to impute the deaths occurred in the triangle ADE.In other words, we create a cohort where the follow-up for deaths starts from the age of 25 years for everyone and subjects who did not survive until the actual baseline in the reality are represented by the imputed subjects.In the right panel of Figure 2  Assuming that the death rates have not changed in calendar time, the expected number of the unobserved deaths corresponding the observed deaths is the ratio of PF to PG which equals to 15/10 = 1.5.The estimated expected number of deaths in the triangle ADE is the sum these ratios over all observed deaths and the number of deaths to be imputed follows the Poisson distribution with this sum as the mean parameter.
The subjects who died during the follow-up are donors in the Lexis diagram imputation.The imputed subject receives the age at death as well as the type of death from the donor.Taking the Bayesian bootstrap weights into account, the weights of donors in the Lexis diagram imputation become where R D is the set of subjects who died during the follow-up and l is the length of the follow-up period.The number of deaths to be imputed N 456 is generated from the Poisson distribution with the mean parameter N η i .Then N 456 imputations are drawn from R D using sampling probabilities proportional to η i .Note that this is equivalent to performing the imputation separately for R 4 , R 5 and R 6 .

NPMI procedure
After imputation the log-likelihood function of the data may be presented as follows where xi stands for the imputed event times and ĝi and ẑi stand for imputed permanent and time-varying covariates, respectively.Note that the choice of the survival model is independent from the imputation.
The whole procedure of the NPMI has the following steps 1. Identify the covariates used as strata and perform imputation independently for each subgroup.
2. Calculate the survival probabilities as in equations 2.3 and 2.4.
3. At the beginning of each imputation round, generate weights w i for all observations according to the Bayesian bootstrap.
4. Sample a donor for each subject with left censored event time.

Apply Lexis diagram imputation to draw a random sample compensating
for the left truncation.
6. Fit a survival model to imputed data without the bootstrap weights and store the estimates.
7. After a suitable number of imputation rounds, combine the estimates using equations 2.1 and 2.2.

Simulation Example
The performance of the NPMI approach is studied in simulations.We consider a cohort of size n 0 + n 1 + n 2 + n 3 + n 4 = 3000.The age at baseline B i is generated from Uniform(30, 65) distribution.The event times X i are generated from Weibull distribution with shape parameter 8.An event is fatal with probability 0.3 and non-fatal otherwise.There are no competing causes of death.The mean event time for zero covariates is set to be m = 65 or m = 80 years and the length of the follow-up is l = 5 or l = 10 years.One thousand simulation runs are generated for each combination of mean event time and length of follow-up.
For the modeling we use the Cox's proportional hazard model (Cox, 1972) where λ(t) is the hazard rate, z 1i , z 2i , and g i denotes the covariates of the ith subject and β 1 , β 2 and α represent the model coefficients to be estimated.Comparisons are made between the NPMI with 20 imputation rounds and the exclusion (Excl.)approach where the left censored observations are ignored and the regression model is estimated from the rest of the data that is right-censored data.Besides the estimated model parameters, we also compare the imputed and the true distribution of left censored event times and the imputed and the true values of covariates.
The distribution of the imputed event times is studied in Figure 3.It can be seen that the empirical cumulative distribution function (cdf) of imputed event times closely resembles the empirical cdf of true left censored event times.The small difference of the curves in smallest event times can be explained by the exclusion of subjects when the left censored event time is smaller than the smallest event time during the follow-up, i.e. b i < x min , i ∈ R 3 .The number of these subjects was small, five or less in 95% of simulation runs.Instead of exclusion of these subjects we also tried the use of measured covariates and age at baseline examination as event time but with exclusion the results were better.It can be also seen from Figure 3 that the distribution of event times during the follow-up clearly differs from the distribution of left censored event times.The distribution of an imputed covariate is studied in Figure 4.The distributions are clearly similar and the distribution of covariates of subjects with an event during the follow-up is only slightly different.We conclude from Figures 3 and 4 that the imputed event times and covariates are unbiased or almost unbiased.
Estimated parameters of model 3.1 are presented in Table 2 for different simulation settings.Two versions of the NPMI are present: the NPMI-N does not compensate for left truncation whereas the NPMI-LX uses Lexis diagram imputation for left truncated observations.It can be seen that the NPMI-N produced biased estimates in all simulation experiments as expected.For the NPMI-LX and the Excl. it seems that there is some bias when the number of events is small (Simulation B) but the bias decreases when the number of events increases and in Simulation C both the NPMI-LX and the Excl.produced unbiased estimates.We conclude that left truncation may have a significant effect on the estimates and recommend using the NPMI only with compensation for left truncation.
The accuracy of the estimators was measured by root mean square errors (RMSE) where βmethod is one of following βNPMI-N , βNPMI-LX , βExcl .The RMSEs are compared to the square root of the average of variance estimates from 2000 simulation runs.It can be seen that the square roots of the estimated variances are close to RMSEs.Comparison between the NPMI and the Excl.reveals that the Excl.resulted slightly smaller RMSEs for parameters β 1 and β 2 but for parameter α the NPMI was clearly better in terms of RMSEs.It also seems that the difference in RMSEs for parameters β 1 and β 2 between the NPMI and the Excl.decreases when the number of events increases.In simulation C, the RMSEs of β 1 and β 2 were almost the same for the NPMI and the Excl.The result can be understood if we consider the amount of information available in the NPMI and the Excl.Covariate G is observed for all subjects and consequently exclusion of observations in the Excl.increases the variance of parameter α.On the other hand, covariates Z 1 and Z 2 are missing for left censored subjects and the imputed covariates contain same amount information as the non-imputed covariates.A small number of events handicaps the NPMI but when the number of events is sufficiently large the variances of covariates Z 1 and Z 2 should be the same for the NPMI and the Excl.

Example with Real Data
To test the NPMI with real world data we consider FINRISK cohorts (Kulathinal et al., 2005;Vartiainen et al., 2000) that are a part of the MORGAM Project.The baseline examinations of the cohorts were in 1982, 1987, 1992 and 1997, and all cohorts were followed up to the end of year 2001 except 1997 cohorts that were followed up to the end of year 2003.In our example, an event is defined as the first incidence of the CHD (fatal or non-fatal) and the follow-up is set to end at the age of 75 years at the latest.The numbers of different events for each cohort are summarized in Table 3.The age at baseline varies from 25 years to 64 years (the upper limit was 74 years in 1997 cohorts but we exclude subjects over 64 years at baseline).The NPMI is suited for this data because there are practically no CHD events before the age of 25 years.Cox's proportional hazard model explaining the event times by classic CHD risk factors is fitted to the data.Age of the subject is used as the time scale.The covariates in the model are the ratio of total to high-density lipoprotein (HDL) cholesterol (MORGAM variable RCHOL), mean of systolic and diastolic blood pressure (BPM), body mass index (BMI), daily smoking (yes or no, DSMOKER) and family history of CHD (defined as the answer to the question: "Has your father had any of the following diseases before the age of 60 years: myocardial infarction or angina pectoris", FHISCHD).The model is fitted separately for men and women and is stratified by the cohort baseline year and the geographical region (East or West).
We compare two approaches for the left censored event times: the NPMI and the Excl.In the Excl.approach 486 men and 168 women are removed from the analysis due to CHD event prior to baseline.Additional 187 men and 127 women are removed because of stroke prior to baseline and 7 men and 6 women are removed because missing covariate measurements (mainly missing BMI).Further, 15 men and 5 women are excluded because of very high RCHOL values (> 15).High values of the cholesterol ratio indicate that total cholesterol is very high and/or HDL cholesterol is very low.Both very high total cholesterol and very low HDL cholesterol are associated with high CHD risk but are best handled as special cases.We are interested in modeling the risk of RCHOL values widely represented in our cohorts and outlying values may have an undesirable impact on the parameter estimates.Missing values of covariates DSMOKER and FHISCHD are combined with category 'no'.
In the NPMI, left censored event times and covariates RCHOL, BPM, BMI and DSMOKER are imputed.Imputation is stratified by sex, family history (FHISCHD), the year of baseline and the region.Family history (FHISCHD) is taken as a permanent covariate although in principle there is a chance for a change from 'no' to 'yes' as time passes.The 187 men and 127 women who had stroke but not myocardial infarction prior to baseline were excluded from the imputation and the analysis.Same exclusion criteria for missing covariates and for very high RCHOL values as in the Excl.approach was used.
The number of imputation rounds is 20.The sunflower plot in Figure 5 presents five imputed event times for each left censored observation.There are relatively few observed events for younger subjects which causes the same donor to be used in several times and is seen as overlapping points in the plot.Figure 5 also illustrates the difference in the CHD incidence between men and women.The distribution of the imputed covariates for R 3 ∪R 4 ∪R 5 is shown in Figure 6.The distribution of the covariates of the subjects with an event during the followup is plotted for comparison.For men the distributions are rather similar but for women there differences especially in BMI.The difference or equality of the distributions does not tell about the performance of the imputation but reflects the changes in the covariates as a function of age.The parameter estimates from the Cox's proportional hazard model are summarized in Table 4. Covariates BPM, RCHOL, DSMOKER and FHISCHD have a statistically significant effect at 95 % risk level in all models.BMI is significant for men.These results are in the agreement with the previous knowledge of these association.The variance of the NPMI estimate of the permanent covariate FHISCHD is smaller than the variance of the Excl.estimate as we could except on the basis of the simulation example.The variance of the other covariate are in general slightly smaller for the Excl.estimates, which is also in the agreement with the simulation results.There are some differences between the NPMI estimates and the Excl.estimates that might not be completely explicable by random variation.We do not have a comprehensive explanation for the differences but the potential explanations include non-proportional hazards, changes in the covariates in the time and unbalanced cohort due to non-response.The detailed analysis of these data will be carried out as a part of the general MORGAM analysis plan.
Figure 7 illustrates the relative importance of the covariates in the cohort.Each covariate is ordered in ascending order and the relative hazard compared to the median of the covariate is plotted as a function of cumulative covariate distribution.This gives insight on the epidemiological significance of the covariates and makes it possible to compare covariates measured on the different scale.According to Figure 7, RCHOL and DSMOKER seem to be the most serious covariates in the FINRISK cohorts.The differences between the NPMI and the Excl.estimates of DSMOKER, FHISCHD and BPM and are visible also in Figure 7.

Conclusion
In this paper, we considered the estimation of regression models from left and right censored survival data and proposed the NPMI approach that converts the left and right censored data into multiple right censored data sets.The left truncation due to deaths prior to baseline is compensated by Lexis diagram imputation.The imputation of left censored data is done without reference to the underlying distribution or model of the event time and hence the procedure can be applied to more general model than the Cox's proportional hazards model.In simulations it was found that the distributions of the imputed event times and the imputed covariates are very close to the true distributions.Good performance was also observed when analyzing real world data from FINRISK cohorts.The NPMI is specifically designed for the data arising from the MORGAM Project but the approach may be applicable for other studies as well.The main requirement for the NPMI approach is the existence of prospective follow-up data for all relevant ages.In the genetic sub-study of MORGAM, the NPMI need to be adapted to the case-cohort design.This does not require any changes to imputation itself.
Compared to the Excl. the main benefit of the NPMI is the gain of efficacy when estimating the effect of permanent covariates.This has practical importance in the MORGAM Project where one of the main goals is the testing of candidate genes.The primary interest is then on the statistical significance of the candidate genes and the classic risk factors in the model have secondary importance.Compared to parametric imputation, the NPMI is robust against to imputation model misspecification.Compared to full likelihood alternatives (e.g.EM algorithm or Bayesian methods) the benefits of the NPMI are speed and straightforward implementation (standard methods and software may be used).In fact, we are not aware of any practical full likelihood based approach that would be directly applicable to the described setup with left truncation.The drawbacks of the NPMI are the need of multiple analyses and small loss of efficacy compared to the Excl.when estimating the effect of imputed covariates from data with small number of the events.A small bias might be also unavoidable if the number of the events is small.
The results in this paper suggest that the inclusion of left censored observations without compensating for left truncation leads to biased estimates.This conclusion is not restricted to the NPMI but applies to all analyses where the events may be fatal or non-fatal and the follow-up is modified to start from a time prior to the recruitment of the cohort.The bias, however, is not necessarily large compared to the standard errors of estimates in moderately sized cohort studies.

Figure 1 :
Figure 1: Illustration of a study design leading to left and right censored data with left truncation.The Lexis diagram of a cohort study is displayed.The follow-up period is from the year 1992 to the year 2001 and the age of the subjects is 25-65 years at the baseline examination.The following variables are presented: B = age at baseline examination, X = time of first CHD event, C = censoring time, T = observed time and D = time of death.In the diagram, the data of eight subjects are presented.Two subjects have an event observed during the follow-up (X = 65 and X = 53).One of the events is fatal (X = 53 and D = 53) and the other is non-fatal (X = 65 and D = 68).One subject is right censored (C = 39).Two subjects have a left censored event (X = 48 and X = 36).At the baseline examination, the existence of a left censored event is recorded but the exact time of an event remains unknown.One of the subjects with left censored event dies during the follow-up period (D = 45); the other survives up to the end of follow-up (C = 64).Three subjects are completely unobserved (D = 56, D = 49 and D = 34).One of them had fatal CHD event (X = 34 and D = 34), one had a non-fatal event (X = 50) and died later (D = 56) and one died (D = 49) without a preceding CHD event.
we have an observed death in year 1995 at age d i = 50, which is represented by the point on the line segment PG.For this age, the line segment PG represents the 10 years of follow-up time and the line segment PF = PA represents 65 − 50 = 15 years of unobserved time.
An illustration of Lexis diagram imputation.The left panel presents deaths that are observed during follow-up and deaths that are not observed because they occurred before the baseline examination in 1992.The right panel shows the geometry of the Lexis diagram.We are interested in all deaths inside the polygon BCDE but observe only the deaths inside the parallelogram ABCD.The deaths in the triangle ADE are imputed using the observed deaths.The ratio of line segments PF and PG gives the expected number of unobserved deaths corresponding to a death at age d i .

Figure 3 :Figure 4 :
Figure 3: Imputed and true distribution of left censored event times in the simulation example.The left panel presents empirical cdfs in a typical realization and the right panel shows empirical cdfs calculated from 100 simulation runs.Solid line represents the NPMI and dashed line represents the true event times.For comparison, event times observed during the follow-up are also plotted (dotted line).The results are from Simulation B; the results from the other simulations are essentially similar.

Figure 5 :
Figure 5: Imputed event times in FINRISK.The number of sunflower leaves is equal to the number of multiple observations.

Figure 6 :
Figure 6: Distribution of imputed covariates (solid line) and distribution of covariates of subjects with an event during the follow-up (dotted line) in FIN-RISK.

Figure 7 :
Figure7: Relative hazard as a function of ordered covariates.Relative hazard is calculated respect to the median of the covariate that is displayed in the legend.Cumulative probability is used in the x-axis instead of the actual covariate values to allow displaying all covariates in the same plot.

Table 1 :
Different types of observations.The variables in the table are: time of event x i , age at baseline b i , censoring time c i and time of death

Table 2 :
Results of the simulation example.Means of the estimates are reported together with RMSE 3.2 and the square roots of the average variances estimated from the model.The estimation methods in the comparison are the NPMI without compensation for left truncation (NPMI-N), the NPMI with the Lexis diagram imputation (NPMI-LX) and exclusion of left censored observations (Excl.).The sample size is 3000 (including left truncated subjects) and the reported numbers are means from 2000 experiments.Simulation parameter m is the mean event time for zero covariates and parameter l is the length of the follow-up.Numbers n 1 , n 2 and n 3 indicate the mean number of non-fatal events, fatal events and left censored events, respectively.

Table 3 :
Summary of the FINRISK cohorts used in the example.n is the total number of subjects, n 1 and n 2 indicate the number of non-fatal and fatal events during the follow-up and n 3 is the number of left censored events.

Table 4 :
Estimated covariate effects for the FINRISK data.Estimates and their standard errors (se) are presented together with relative hazard estimates (exp(estim.)) and their 95 % confidence intervals.