A GEE Approach for Estimating Correlation Coefficients Involving Left-censored Variables

HIV (Human Immunodeficiency Virus) researchers are often concerned with the correlation between HIV viral load measurements and CD4+ lymphocyte counts. Due to the lower limits of detection (LOD) of the available assays, HIV viral load measurements are subject to left-censoring. Motivated by these considerations, the maximum likelihood (ML) method under normality assumptions was recently proposed for estimating the correlation between two continuous variables that are subject to left-censoring. In this paper, we propose a generalized estimating equations (GEE) approach as an alternative to estimate such a correlation coefficient. We investigate the robustness to the normality assumption of the ML and the GEE approaches via simulations. An actual HIV data example is used for illustration.


Introduction
Viral load assessment via quantification of plasma viral RNA (Ribonucleic Acid) plays an important role in current HIV (Human Immunodeficiency Virus) research.It has provided valuable insights into the pathogenesis of HIV disease and the activity of anti-viral drugs.However, inherent limits of detection (LOD) in existing HIV RNA assays lead to the possibility of left-censored (also termed missing) data.Such left-censored data is also characteristic of many other types of bioassay studies (Lynn, 2001).As both CD4+ cell counts and plasma HIV RNA are bio-markers of the progression of HIV disease, the study of the correlation between the CD4 + cell count and the HIV viral load is often among HIV researchers' interests.Therefore, there is a need to estimate the correlation between two variables, where one of them may be left-censored.
An ad hoc but convenient approach is to replace the censored values by, e.g., 1, 1/2 or 1/ √ 2 times the LOD, depending on the assumed shape of the left tail of the distribution (Hornung and Reed, 1990;Olson, 1993).This method can cause bias and even misinterpretation, especially when the censoring rate is high (Lyles et al., 2001).Two available parametric methods for the case of one left-censored variable are multiple imputation (MI) and maximum likelihood estimation (ML) assuming an underlying bivariate normal distribution.Comparing the two approaches, Lynn (2001) favored the ML approach in terms of bias.Lyles (2001) found the two methods comparable for point estimates and argued for an MI paradigm on the basis of confidence interval performance.However, the ML approach makes a full normality assumption and the performance of the ML method with non-normal data has not been studied.
In this paper, we propose an alternative approach using generalized estimating equations (GEE) to estimate correlations involving left-censored data, and we compare it with the ML approach via simulated normal and non-normal data.The GEE approach has been widely applied in many statistical applications due to its less stringent distributional assumptions and robustness properties (Liang and Zeger, 1986;Zeger and Liang, 1986).It has also been used to estimate correlation coefficients when data are not censored (Qu et al, 1992;Catalano and Ryan, 1992;Barnhart and Williamson, 2001).
In section 2, we present GEE models for estimating correlation coefficients when one variable is subject to left-censoring.In section 3, we report several simulation studies aimed at comparing the GEE and the ML approaches for both normal and non-normal data.In section 4, we apply the proposed method to a real data set stemming from clinical trials conducted in Bangkok, Thailand.In this example, we investigate the correlation between cervicovaginal HIV viral load measurements and CD4 + lymphocyte counts from HIV positive women.We also include an example in this section to illustrate the potential advantage of the proposed GEE approach when modeling covariates that may impact the correlation coefficient.A discussion is presented in section 5.

Method
Let X and Y be continuous random variables.Let X L be a left-censored variable corresponding to X with LOD (cut point) L x .Specifically, we define X L as follows: where x 0 can be any fixed constant that is less than or equals to L x (see section 2.3 regarding the choice of x 0 ).We assume that (X, Y ) has a mean of (µ x , µ y ) and a covariance matrix of Our main interest is to estimate the correlation ρ between X and Y, where X is subject to left-censoring.For simplicity, we do not introduce covariates initially; an extension to incorporate covariates will be presented later.Although we make use of the bivariate normal distribution when computing expectations in the development of the methodology specified below, the full bivariate normality assumption is not theoretically necessary.Rather, the GEE approach requires only that the moment specifications are approximately correct.
We present two sets of generalized estimating equations for estimating the correlation coefficient ρ, given paired observations of the left-censored variable (X L ) and the complete variable (Y ).Let {x Li , y i }, i = 1, • • • , N, be random realizations of X L and Y.In order to estimate the correlation coefficient, one first needs to estimate the parameter θ = (µ x , µ y , σ x , σ y ) .Letting we propose the first GEE equation to estimate θ by modeling the marginal mean of Z i with E(Z i ) = U(θ) as follows: where U(θ) = (U[1], µ y , U[3], σ 2 y + µ 2 y ).Computing moments as if X and Y are distributed as bivariate normal, then, , where τ x = (L x − µ x )/σ x , Φ(x) denotes the standard univariate normal cumulative density function and φ(x) refers to the standard normal density function.D i = ∂U/∂θ is a 4×4 matrix (see Appendix) and V i is the working covariance matrix.The GEE approach uses empirical covariance estimates to adjust for a mis-specified covariance structure without loss of much efficiency (Liang and Zeger, 1986;Zeger and Liang, 1986).For convenience, we take V i as a diagonal matrix with diagonal entries as the variances of X L , Y, X 2 and Y 2 that apply under normality.Thus, we have , where diag(A) denotes the diagonal matrix with vector A as the diagonal elements.
We propose to estimate ρ by modeling the mean of the conditional distribution of X Li |Y i .Note that under normality, X|Y is distributed as where We solve for ρ using the following second estimating equation: where and Here, W i is the working variance of X Li |Y i .In practice, we use x , as obtained under normality.To obtain the point estimate of ρ, a modified Fisher-scoring iterative procedure is used.Specifically, we obtain the estimate of θ, by the iteration process, By replacing θ with θ in equation (2.2), the estimate of ρ is obtained by the iteration process, Following prior arguments regarding generalized estimating equations (Liang, Zeger and Qaqish, 1992), we can readily show that the parameter estimates are consistent provided that U(θ) and γ i (ρ, θ) are correctly specified.This is true regardless of whether the working covariance matrices in the two sets of equations are correctly specified.To obtain the standard error of ρ, we follow the similar procedures as in Prentice (1988) and Barnhart and Williamson (2001).

Evaluating impact of covariates on the correlation
A particular advantage of the proposed GEE approach is that it can easily be extended to investigate covariates' impact on the correlation.Let Q 1 denote the design matrix formed by covariates.First, we model θ as θ = Q 1 β, where the parameter estimates of β can be obtained by modifying equation (2.1) as where D i = ∂U i /∂β.Second, we use Fisher's Z-transformation to model the correlation coefficient as We use Fisher's Z-transformation here because it ranges from −∞ to ∞ and its quantity is more stable than ρ.Q 2 is a design matrix of covariates impacting ρ that may or may not be the same as Q 1 .The estimates of α can be solved by modifying equation (2.2) as

Remark on the choice of x 0
The censored observations of X are assigned to a fixed value, x 0 (less than or equal to the LOD).In general, the consistency of the proposed GEE estimator holds regardless of the choice of x 0 .We have performed simulations to evaluate different choices of x 0 (not shown) and found minimal bias in the GEE estimate in each case.However, defining x 0 based on the appealing condition that E(X) = E(X L ) tends to provide improved efficiency and 95% coverage.This implies that (2.5) Because µ x and σ x are unknown, we employ the following procedure in solving equation (2.1).First, x 0 = L x /2 is used as the initial value, then x 0 is updated with the updated θ.Alternatively, one can insert ML estimates (e.g., Cohen, 1959) of µ x and σ x assuming a univariate left-censored normal model to get reasonable estimate for x 0 .

Simulations
We summarize two simulation studies in this section.The first is designed to assess the performance of the GEE estimates for ρ, and to compare them with the MLEs based on normal data.For ease of comparison, simulations were performed using the same true parameter settings as in Lyles et al. (2001).Without loss of generality, we set θ = (0, 0, 1, 1) .The second simulation study is conducted to assess the robustness to the normality assumption using both the proposed GEE approach and the ML approach of Lyles et al. (2001).
We first generated continuous responses {x i , y i }, i = 1, • • • , N, from the standard bivariate normal distribution.Then, {x Li } was generated from {x i } according to L x .The true value of L x is determined by the censoring rate and the true parameter settings.Simulations were conducted for several combinations of censoring rate (20%, 60%), the true value of ρ (0.25, 0.5, or 0.75) and sample size N (50 or 100), with a total of 12 simulation settings.A total of 1000 simulated data sets were generated for each combination.Table 1 summarizes these simulation results.We report the means of ρ, the means of the empirically corrected standard error of ρ, the standard deviation of the 1000 ρ's and the 95% coverage based on the estimated standard error of ρ.Corresponding results using ML (Lyles et al., 2001a) are also included for comparison.We observe that the GEE approach performed similarly to ML, although there is a tendency toward underestimation of the standard error of ρ for sample size N = 50.As suggested by Barnhart and Williamson (2001), an adjustment to the estimated standard error, such as multiplying by N/(N − 1) or (N/(N − 2), may be useful when N is less than or equal to 50.
To explore the robustness to non-normality of the GEE and ML approaches, we considered two kinds of non-normally distributed data: correlated uniform data in the interval (−1, 1), and correlated chi-square data (with degrees of freedom 10, moderately skewed).These correlated non-normal data were generated by a method similar to that of Saucier (2000).Although data were generated from non-normal distributions, the mean specifications in the two sets of estimating equations and the full likelihood for the ML method were based on the normality assumption and thus misspecified, motivating the robustness study.
Simulations were performed for each combination of the censoring rate (20%, 60%) and true ρ (0.25 or 0.5).We used a sample size of 100 and a total of 1000 simulated data sets in each case.As shown in Table 2, both the GEE and ML approaches continue to perform reasonably well for correlated uniform data and correlated chi-square data in the context of point estimates and 95% coverage.The point estimate for ρ from the GEE approach shows minimal bias for correlated uniform data with a low or moderate censoring rate, and a slight tendency toward underestimation for moderately skewed correlated chi-square data.Compared with the GEE estimates, the MLEs of ρ tend to be slightly larger in the latter case.For correlated uniform data, the mean standard error based on the GEE method is closer to the empirical standard deviation than the one based on the ML method, where the empirical standard deviation is calculated from 1000 estimates obtained from the 1000 simulated data sets.The disparity between the two based on ML translates into the overly conservative 95% coverage observed in Table 2.
In summary, the proposed GEE method performs very consistently with ML for point estimation under bivariate normality.For heavy-tailed and moderately skewed non-normal data, both the GEE and ML methods remain relatively robust, with the GEE method marginally better than ML with respect to closeness of the estimated standard errors to the empirical standard deviations based on simulated data sets .

Examples
Ever since the emergence of HIV viral load as a new virologic marker for describing HIV/AIDS progression, researchers have been interested in the correlation of the HIV viral load measurements with the immunologic marker, the CD4+ cell count, which was traditionally the standard benchmark for prognosis of HIV patients (Mulder, et al., 1994;Mellors et al., 1995).Because the assay for HIV viral load has a lower limit of detection, it is important to account for this in estimating the correlation between the HIV viral load and CD4+ cell counts.We apply the proposed methods to estimate this correlation from a HIV study from a Centers for Disease Control and Prevention (CDC)-sponsored clinical trial, conducted in Bangkok, Thailand.Data were collected from 154 women in a Zidovudine-treated subgroup of the study.Available data are CD4+ cell counts (Y) at 36 weeks' gestation and HIV viral load from cervicovaginal lavage (CVL, X L ) at 38 weeks' gestation.Details of the study and the data were reported elsewhere (Shaffer et. al., 1999).A total of 120 out of 154 women had non-detectable CVL measurements, implying a 77.9% left censoring rate.As suggested by Lyles et al. (2001), a log transformation was applied to the CVL measurements.As the lower limit of detection of the assay is 400 copies/ML, we have L x = log 10 (400) = 2.60206 in this example.Results of the analysis are presented in Table 3.We also include results based on 1000 simulated data sets, where the true parameters are set to the GEE estimates θ = (1.572, 444.63, 1.335, 201.78) and the sample size is the same as in the data set.The corresponding results from ML and the ad hoc method that replaces censored values by LOD/2 are also included for comparison.Both the GEE and ML methods give similar results, while the ad hoc approach is strongly biased.These results suggest that the correlation between CVL measurements and CD4+ counts for HIV-infected women in the zidovudine group is negative, with an estimated magnitude of 0.24.To illustrate the GEE approach for data with covariates, we use additional data from the same study as in the first example by including an additional 155 women who received a placebo instead of Zidovudine.A total of 74 women in the placebo group had non-detectable measurements of HIV RNA in CVL, implying a 47.7% censoring rate, as compared to the 77.9% censoring rate among women in Zidovudine group.We seek to determine whether the correlation between CD4+ counts and HIV RNA measurements in CVL for women in the Zidovudine group is significantly different from that in the placebo group.We model θ by where x is the indicator variable for the Zidovudine group.The estimate of α = (α 0 , α 1 ) is (−0.323,0.090) with standard error of (0.189, 0.274).Thus the estimated correlations are −0.23 and −0.31 for the Zidovudine and placebo groups, respectively.This GEE result immediately implies a p-value of 0.37 for testing α 1 = 0, indicating that the correlation between CD4+ cell counts and HIV RNA measurements in CVL for HIV-infected women in the Zidovudine group is not significantly different from that in the placebo group.

Discussion
We have proposed a GEE approach for estimating the correlation between two continuous variables, where one variable is subject to left censoring.We evaluated the method by simulation studies and illustrated it with a clinical example.We also investigated the robustness to the normality assumption for both the ML approach and the GEE approach via simulations.As an alternative to ML, the GEE estimates for ρ performed comparably in both simulation studies and on real data sets.In addition to providing a unified framework for estimation of and inference about ρ, GEE may be easier to use than ML with respect to modeling covariates' impact on the correlation.Although we make use of the bivariate normality assumption in computing expectations, the full normal distribution is not required as long as the means are correctly specified in the two GEE sets.Based on the simulation study for non-normal data with misspecified means, we found that the GEE method performed similarly to or marginally better than the ML method for uniform and moderately skewed data.
Although the ML approach may be used as a standard method for estimating the correlation between left-censored variables, we tend to favor use of the GEE approach because of its theoretical advantage, because it performs similarly to ML without requiring optimization routines, and because of its convenience for modeling covariates' impact on correlation.When one or both variables under study are very strongly skewed, we recommend use of a log or other transformation before applying either approach (as seen in the real data examples).Additional simulation studies (not detailed here) for highly right-skewed bivariate log-normal data indicated potentially poor confidence interval coverage without a transformation.
We have employed a two-stage estimating equations approach.Specifically, we estimate θ in the first set of equations and then plug into the second set of equations to estimate ρ.Another alternative (not explored here) would be to use a three-stage approach, i.e, to estimate (µ y , σ y ) in the first equation, estimate (µ x , σ x ) in the second equation, and then estimate ρ in the third equation.We believe that the 2-stage approach is more efficient than the 3-stage approach in estimating θ (Liang, et al., 1992).
An alternative way to estimate ρ is to model the mean of the product (E(X L Y )), instead of the conditional mean (E(X L |Y )).However, E(X L Y ) involves double integration and is thus computationally intensive.Furthermore, according to Carey et al. (1993), modeling the conditional mean has reasonable efficiency as compared to modeling the mean of the product.Therefore, we prefer our approach to estimate ρ by modeling the conditional mean.
The computer programs used in this research were written in Splus (Windows 2000) and R (Version 1.5.1, 2002).All programs are available from the first author upon request.

Table 1 :
Simulation Study for the Case of One Left-censored Variable

Table 2 :
Simulation Study for Investigating Robustness of Normality Assumption (N = 100)

Table 3 :
Correlation between CD4+ cell counts and log-transformed HIV RNA in CVL