Application and Comparison of Methods for Analysing Correlated Interval-censored Data from Sexual Partnerships

In epidemiological studies where subjects are seen periodically on follow-up visits, interval-censored data occur naturally. The exact time the change of state (such as HIV seroconversion) occurs is not known exactly, only that it occurred sometime within a specific time interval. This paper considers estimation of parameters when HIV infection times are intervalcensored and correlated. It is assumed that each sexual partnership has a specific unobservable random effect that induces association between infection times. Parameters are estimated using the expectation-maximization algorithm and the Gibbs sampler. The results from the two methods are compared. Both methods yield fixed effects and baseline hazard estimates that are comparable. However, standard errors and frailty variance estimates are underestimated in the expectation-maximization algorithm compared to those from the Gibbs sampler. The Gibbs sampler is considered a plausible alternative to the expectation-maximization algorithm.


Introduction
Interval-censored data arise in research settings where the exact time an event occurs is not observed directly, but only the time interval to which the observation belongs is observed.For instance, the exact time HIV seroconversion occurs is not observed exactly (Jewell, et al. 1994) but only the clinical examination times between which HIV infection occurred.Estimation methods for intervalcensored data are often based on the Cox proportional hazards model (Cox, 1972).Finkelstein (1986) generalized the Cox proportional hazards model to account for interval-censored data.Huang and Wellner (1997) provide a rigorous theoretical account for maximum likelihood methods for interval-censored data.
Dependency of event times further complicates estimation in interval-censored data.Dependency may arise as a result of the sampling method used, such as in the study of HIV seroconversion among cohorts of circular migrant men, nonmigrant men and their non-migrant sexual partners as described in Lurie et al. (2003a;2003b), and is the subject of our analysis, see Section 2. This dependency is often modelled as random effects or frailties.Frailty is the term describing common excess risk of infection among members of the same sub-group.Frailties are considered unobserved mutually independent random variables specified by some parametric distribution.The topic of frailty models has received considerable attention in demography (Vaupel, et al. 1987) and statistics (Clayton, 1978;Clayton and Cuzick, 1985).Including frailties in the interval-censored data likelihoods of (Finkelstein, 1986) or (Huang and Wellner, 1997) results in complex intractable likelihood functions.The conjugate gamma frailty distribution often assumed in standard survival frailty model (Klein, 1992) is no longer conjugate in the interval-censored likelihood (Finkelstein, 1986;Huang and Wellner, 1997).
In this paper, both the interval-censored infection time and frailties are treated as missing data.The primary goal of this paper is to apply and compare two statistical methods of analysing correlated interval-censored data.The two statistical methods considered are the expectation-maximization (EM) algorithm (Dempster, Laird and Rubin, 1977) and Bayesian analysis using Markov chain Monte Carlo (MCMC) methods (Gilks, et al. 1996;Carlin and Louis, 1996).The primary outcome variable is the survival time from 1990 until HIV infection, or until the end of the study or until the subject was lost to follow-up.In South Africa, research shows that the epidemic of HIV well established in 1990 (Gouws and Williams, 2000 and references therein) and thus 1990 is used as the initial time.The main focus is on full Bayesian analysis and its comparison to maximum likelihood estimation.Section 3 of this paper presents the migration data to be analyzed.The assumed conditional survival model is presented in Section 3. Section 4 presents the likelihood formulation for the EM estimation.Full Bayesian estimation is presented in Section 5.The application and conclusion are presented in Sections 6 and 7, respectively.

The Data
A total of 631 individuals aged between 18 and 60 years were recruited into the study.The study composed of circular migrant men, non-migrant men and their rural based non-migrant sexual partners.Circular migration is the predominant type of migration in South Africa where young men migrate to work in urban areas leaving their rural sexual partners behind, and return home periodically.Circular migrant men were recruited from their workplaces in urban areas.They provided details of their sexual partners residing in rural areas, who were then located and invited to participate.In the neighbourhood of each migrant man's rural household, a non-migrant man and his partner(s) were selected and invited to participate.The study participants were visited approximately every four months to administer a detailed questionnaire eliciting information related to  (Lurie, et al. 2003a;2003b).The current analysis is restricted to 339 identifiable distinct sexual partnerships from 604 individuals.The mean sexual partnership size is 1.78 individuals.Sexual partnership size ranges from 1 to 5 individuals with only one man in each sexual partnership.Table 1 shows the distribution of sexual partnerships and percentage of persons infected with HIV.Considerable numbers of migrant men gave incorrect information about the location of their partners and some of their identified partners refused to participate resulting in cases where only a man was included making up to 36% of five different number of included participants in a partnership.Fifty-two percent of sexual partnerships were couples.
Men (mostly migrants) whose partners were not part of the study, contributed considerably to high HIV infection in sexual partnerships where only one partner was included.HIV infection was considerably higher in triads than couples.There are small proportions of sexual partnerships of size greater than three to make valid comparisons.The maximum total number of HIV infected members per sexual partnership size was three, and was among triads.The mean age at first sexual intercourse was 18 and 17 years for men and women, respectively.The mean number of lifetime partners was 15.8 and 2.0 for men and women, respectively.Men tend to overstate their sexual behaviour whilst women understate theirs.

The Conditional Survival Model
The data analyzed is clustered within sexual partnerships.The HIV infection time is interval-censored such that it is unobserved but only the time interval within which HIV infection occurs is observed.
where L ij is the last time a person tested HIV negative and U ij is the first time a person tested HIV positive.For right-censored observation, the observed time is L ij .For notational simplicity, let Y ij = (L ij , t ij ) denote both observed right-censored time L ij and unobserved infection time t ij .Define a non-censoring indicator δ ij = 1 if HIV positive and 0 otherwise.The ith sexual partnership frailty is denoted by b i .
The multiplicative frailty model is assumed.Baseline hazards are assumed constant λ 0 (y ij ) = λ 0 and the corresponding integrated baseline hazard is Λ 0 (y ij ) = λ 0 y ij .Throughout this paper, it is assumed that censoring and infection times are independent.Therefore, the censoring process is non-informative.Conditional on b i , survival times are independent and their conditional hazards distribution is where X ij is the vector of covariates and β represents the corresponding covariate effect.Those infected with HIV contribute to the likelihood the product of their conditional hazards and conditional survival function whilst those who were rightcensored contribute only the conditional survival function.The conditional sur- The frailties act multiplicatively on the baseline hazard and are interpreted as relative risks (RR)s.They are unobserved and take only positive values.In this work, they are modelled as independent random variates from a Gamma(α, α) distribution.The RR for sexual partnerships has mean 1 and variance 1/α in this case.The unit mean constrain ensures that the sexual partnership effects represent deviations from the population average risk.The gamma distribution is a popular choice for frailties, possibly due to its flexible shape and conjugacy property (Guo and Rodriguez, 1992;Klein, 1992;Bolstad and Manda, 2001).Heckman and Singer (1984) and Pickles and Crouchley (1995) discuss problems associated with the choice of the frailty distribution.

Maximum Likelihood Estimation
This section examines estimation of fixed effect parameters β, baseline hazard λ 0 and association parameter α from the Gamma frailty distribution using maximum likelihood approach.To do this, we need the joint distribution of the response vector and frailties.The response vector y i of sexual partnership i consists of (possibly sub-vectors) observed v i and unobserved t i .Using conditional independence between y i given b i , the complete-data likelihood contribution for sexual partnership i is where θ = {α, λ 0 , β}.The complete-data log-likelihood for sexual partnership i corresponding to (4.1) is The complete-data log-likelihood (4.2) depends on functions of unobserved infection time (t ij ) and sexual partnership specific frailty (b i ).Implementation of the EM algorithm requires calculation of Q(θ; θ (r) ) equal to the conditional expectation of (4.2) over all functions of unobserved data, given the observed data and current estimate θ (r) of θ.The observed data likelihood L i (v i ; θ) is attained by integrating out unobserved data from (4.1) as follows: where δ i+ = J i i=1 δ ij is the total number of HIV positive members of a particular sexual partnership network.The number of people infected with HIV in a particular sexual partnership is known at the analysis stage and thus δ i+ .
It can be seen from ( 4.2) that we need only the conditional expectations of b i , log b i and b i t ij .This is because (4.2) is a linear function of these quantities.The conditional expectations of b i and log b i are computed using the marginal conditional distribution of b i given the observed data.Firstly, the joint marginal distribution f Thus, the marginal conditional distribution of b i is given by where The integrand expands depending on δ i+ .Therefore, the sums of integrals will disappear by making each integrand mimic a gamma density.The EM algorithm proceeds by iteratively computing the following E-and M-steps.
• E-step: The E-step computes the conditional expectations of b i , log b i and b i t ij given the observed data and current parameter estimates of θ.The conditional expectation of b i simplifies whilst that of log b i can be approximated numerically.The conditional expectation is the joint conditional distribution obtained by first integrating out the remaining sexual partnership unobserved infection times t ij , for j = j from L i (b i , v i , t i ; θ).
• M-step: The M-step of the algorithm involves maximizing Q(θ; θ (r) ) after replacing b i , log b i and product b i t ij by their conditional expectations in (4.2).Maximization is accomplished via Newton-Raphson algorithm which requires evaluation of the first and second derivatives of Q(θ; θ (r) ).
The two steps are iterated until convergence criterion is met.The EM algorithm may show slow convergence if there is large amount of missing data, or if the estimated hyperparameter heavily depends on missing data.The rate of convergence in this study was satisfactory and thus complex computations involved in the acceleration procedures were not used (Laird, Lange and Stram, 1987).

The Full Bayesian Estimation
The model framework presented in the preceding section is hierarchical and fully specified from the frequentist point of view and the model parameters have been estimated using the EM algorithm.In addition to this, we need to specify priors for fixed effects β, baseline hazard λ 0 and hyperparameter α before the model is fully specified from a Bayesian perspective.The prior for β is assumed multivariate normal with mean vector d 0 = 0 and diagonal covariance matrix Σ 0 = υ 0 I, where υ 0 is a suitably chosen large number.The prior mean is set to 0 since the fixed effects represent logarithms of RRs and not expected to be far from 0. The effect of these priors on the marginal posteriors of the regression coefficients is almost identical to the flat priors.A Gamma(ξ 0 , ζ 0 ) prior distribution is specified for the baseline hazard.The specification of priors for precision parameter is more difficult in hierarchical model setting.An improper prior can lead to an improper posterior (Hobert and Casella, 1996).Thus, a Gamma (ν 0 , κ 0 ) prior for precision component is often assumed due to its conjugacy status.All prior distributions are assumed independent of each other.
The modelling framework considered here is related to the work of Clayton (1991), Gustafson (1997) and Bolstad and Manda (2001).All these authors discuss Bayesian models for hierarchical multivariate survival data for precisely known failure times.Gustafson (1997) used similar approach in the implementation of Cox partial likelihood.Bolstad and Manda (2001) presented a three-way multilevel model for child mortality.In this work, an important aspect of sampling interval-censored failure times conditional on examination times, frailties and other observed data is considered.Sinha and Dey (1997) reviewed a number of Bayesian methods for analysing survival data and clearly, the extensions of semiparametric Bayesian model for analysis of multivariate survival data using frailty model (Clayton, 1991) to interval-censored data are not immediate.
Figure 1 presents the directed acyclic graph of the model.Each parameter node is circled; the data and prior constants are indicated by rectangles denoting that they are fixed.The joint distribution of all parameters, hyperparameters and the data can be written as the product of all prior and conditional distribution as follows: In Bayesian analysis, the joint posterior distribution of all parameters given the data is required.In our model, the joint posterior distribution cannot be obtained analytically.Instead, we use the Gibbs sampler (Geman and Geman, 1984) to obtain the required posterior.The Gibbs sampler proceeds by iteratively sampling from the conditional posterior distribution of each parameter using the most recent values of the given parameters.This generates a Markov chain in the process and the chain has the joint posterior as its long-run distribution.
In a hierarchical model, the conditional distribution of one node given all the other nodes is proportional to the prior distribution of that node times the conditional distribution of all its direct child nodes and co-parent nodes.The relevant Gibbs conditionals can be computed from the joint distribution (5.1) and some of these conditionals are presented here below: which we recognize as the kernel of a gamma distribution with shape α+ J i j=1 δ ij and inverse scale α . This node can be sampled directly.
which we recognize as the kernel of a gamma distribution with shape 1 and inverse scale b i λ 0 e β X ij .Such a gamma distribution is equivalent to an exponential distribution with parameter b i λ 0 e β X ij .This node can also be sampled directly on condition that the sampled value This full conditional does not simplify to any standard distribution.Methods for sampling from an arbitrary conditional distribution are required.It turns out that the full conditional distribution is a simple log-concave distribution in α and thus can be sampled efficiently using the adaptive-rejection sampling scheme (Gilks and Wild, 1992).
If a flat prior f (β) = 1 is assumed for β, the posterior mode can be replaced by the maximum likelihood estimate β and the log posterior density by the log likelihood function.The normal approximation, with mean and covariance matrix equal to the mode and inverse of the information obtained from the maximum likelihood estimation, can be used in the Metropolis step to generate candidates for β.

Application and Comparison
The possible risk factors of HIV considered in our analysis include migration status, age at recruitment, number of lifetime partners, and number of recent sexual contact partners, syphilis status and status of other STIs.Other STIs refer to the status of any of the following STIs: chlamydia, gonorrhoea, genital discharge and genital sores.These are typical covariates that are considered important determinants of HIV infection.Circular migration is one of the structural factors associated with HIV infection, but the dynamics and complex role of circular migration as a determinant of HIV infection is still a major issue for social science research.Importance of migration as a risk factor lies in the assumption that circular migrant men, whilst away from their partners, engage in risky sexual behaviour with other female sexual partners (Lurie, et al. 1997).During this period of migration, partners of circular migrant men are also as likely to acquire extra sexual partners (Lurie, et al. 1997).Risk factors parallel to the epidemic of HIV such as the number of lifetime partners are considered important determinants of HIV infection due to their cumulative effect.Evidence shows that STIs, both ulcerative and non-ulcerative, facilitate transmission of HIV (Wasserheit, 1992).
The EM algorithm and the Gibbs sampler were implemented on Microsoft Visual C++ Version 6.0.For Bayesian inference, five parallel chains were run from independent starting points for 2n =4 000.All the fixed effects parameters, some random effects, inverse scale, baseline hazard and some infection times were monitored for convergence.Gelman and Rubin's (1992) scale reduction factor and other convergence checks were computed.These convergence checks were satisfactory.The first 2 000 iterations were discarded.Starting from the 4 000th iteration, a further 38 000 values were simulated.Every 100th value was taken resulting in 2 000 nearly independent samples from the joint posterior distribution.
It is worth mentioning that fixed effect sampling scheme involved an EM estimation for maximum likelihood estimates and the calculation of Fisher information matrix.The resulting estimates were used in the proposal density for the Metropolis step.This was computationally intensive and equal sampling of all parameters led to correlated values of b i , t ij , α and λ 0 compared to β values.Thus, the iteration scheme was modified to iterate through b i , t ij , α and λ 0 five times for each draw of β.The modification greatly improved efficiency.The acceptance rate for candidate β was about 54%, which was well within 30% and 70%, the recommended acceptance rate (Raftery and Lewis, 1996).The high acceptance rate indicates that the multivariate normal proposal distribution is a good initial approximation to the actual conditional posterior.
The estimates from both methods are presented in Table 2.The fixed effects estimates are presented on a log scale where no risk is represented by 0. Standard deviations from the EM algorithm were computed using the SEM methods proposed by Meng and Rubin (1991).The estimates of the fixed effects and baseline hazard are similar for all practical purposes to the respective modes obtained from the EM algorithm.However, variance parameter estimates differ markedly between the two methods.The Gibbs sampler provides variance estimates that are larger than those from the EM algorithm.The estimate of sexual partnership frailty variance from the Gibbs sampler is quite large compared to the estimate obtained from the EM algorithm.The posterior median and mean is 0.788 and 0.812, respectively.The 95% credible interval for sexual network frailty variance is (0.614, 1.120).In the EM algorithm, the mode of the sexual partnership frailty variance was estimated to be 0.462.The unfavorable consequence is that if one based inference on the EM algorithm, the resulting confidence intervals would be narrower and more significant.

Data
The Gibbs sampler and the EM algorithm have been implemented on correlated interval-censored data.The paper showed that Bayesian analysis via MCMC is capable of not only incorporating information about frailties and infection time, but also uncertainties about available information.For example, the uncertainty about the true values of variance components is formally incorporated into the analysis through the choice of a plausible prior distribution.
The fixed effects results from the Gibbs sampler are in good agreement with the corresponding posterior modes from the EM algorithm.This agreement is generally expected due to the specified proper prior for fixed effects which is nearly flat in the region near zero (Harville, 1974).However, estimated standard deviations from the likelihood approach are severely biased downwards.The bias reflects the incapability of likelihood approach to correct for variability of unobserved frailties and infection time (Ripatti and Palmgren, 1999).Downward bias in standard deviations observed in the likelihood estimation is highly undesirable because it provides false sense of security for the estimates.The frailty variance estimate from the EM algorithm also shows similar downward bias compared to the estimate from the Gibbs sampler.
The Gibbs sampler has been shown to be a plausible alternative to the EM algorithm in this setting.The Gibbs sampler does not require evaluation of highdimensional integrals as done in the EM algorithm.Estimation via the Gibbs sampler is advantageous in that it is easily extendable to other frailty distributions (Sargent, 1998).Superiority of the Gibbs sampler has also appeared in three-way multilevel hazards model for right-censored data (Manda, 2001).
Overseas Development Assistance.Additional support came from a training grant awarded by the National Institute of Drug Abuse to the Miriam Hospital (Grant 5 T32 DA13911) and National Institute of Mental Health Career Development Award granted to Dr. Lurie at Brown University (Grant 1 K01 MH069113-01A1).

Figure 1 :
Figure 1: The directed acyclic graphical model representation of migration data

Table 1 :
Distribution of sexual partnerships and HIV infection

Table 2 :
Results for HIV infection among migratory partnerships form South Africa