Analysis of Correlation Structures using Generalized Estimating Equation Approach for Longitudinal Binary Data

Longitudinal binary data often arise in clinical trials when repeated measurements, positive or negative to certain tests, are made on the same subject over time. To account for the serial correlation within subjects, we propose a marginal logistic model which is implemented using the Generalized Estimating Equation (GEE) approach with working correlation matrices adopting some widely used forms. The aim of this paper is to seek some robust working correlation matrices that give consistently good fit to the data. Model-fit is assessed using the modified expected utility of Walker & GutiérrezPeña (1999). To evaluate the effect of the length of time series and the strength of serial correlation on the robustness of various working correlation matrices, the models are demonstrated using three data sets containing respectively all short time series, all long time series and time series of varying length. We identify factors that affect the choice of robust working correlation matrices and give suggestions under different situations.


Introduction
In longitudinal studies, observations measured repeatedly from the same subject over time are serially correlated.Time series with rather pronounced serial correlation are common in clinical trials and stock markets.When observations are measured on a continuous scale, the dependency structure between observations can be modelled in a covariance matrix with nonconstant variances and non-zero covariances.There are different types of covariance matrices to model such dependency between observations from the same subject.
On the other hand, when the outcomes are binary such as the positive or negative results to certain tests, we usually consider a generalized linear model (GLM) within the exponential family of sampling distributions with different link functions.In this paper, we propose a marginal logistic model and the parameters are estimated using the first order Generalised Estimating Equation (GEE) approach (Liang & Zeger 1986, Zeger et al. 1988and Hardin & Hilbe 2003).Marginal model is easy to interpret and forecast.While the dependency structure can be explicitly modelled by a working correlation matrix adopting some popular and general correlation structures, the dependency structure of a conditional model is often limited to an auto-regressive type only.Moreover marginal model allows the study of population-averaged effects of covariates on the response variable.Although random effects models allow within subject correlation, it fails to account for the serial correlation over time.
Over the past two decades, GEE approach has been developed and advanced.For example, Zhao and Prentice (1990) and Lipsitz et al. (1991) apply the GEE approach to analyse binary data, Kenward et al. (1994) consider ordinal data, Lipsitz et al. (1994) use categorical data, Prentice and Zhao (1991) focus on multivariate data, Park (1993) compares GEE approach to maximum likelihood approach and Miller et al. (1993) to weighted least squares approach, etc.Despite the emergence of new techniques in recent years, for example, the Bayesian and semi-parametric approaches, GEE approach is still popular and often offers an easy alternative.See Horton and Lipsitz (1999) for its implementation in various softwares and the new geepack module in R. Recent literature on GEE approach includes Ballinger (2004) for counts and continuous data and Copas and Seaman (2010) on the asymptotic bias using GEE under the missing at random (MAR) dropout environment.Although it has been known that parameter estimates using the GEE approach are robust to misspecification of the true correlation matrix, our experience with real data shows disagreement.Instead, we found that parameter estimates change considerably across models adopting different correlation matrices.In this regard, sensitivity analyses through simulation experiments are often performed to evaluate the per-formances of different models.However, it is difficult to simulate binary data with a specific working correlation matrix when the actual correlation structure is often unknown in practice.Hence we propose, in this paper, to fit models adopting different working correlation matrices and assess the model-fit using the modified expected utility (Walker & Gutiérrez-Peña, 1999).Working correlation matrix of the best model reveals valuable information regarding the actual dependency structure between observations.Since the length of the time series may affect the choice of robust correlation matrices, we investigate three data sets containing all short time series (N =3), all long time series (N =14) and time series of varying length (N ranges from 4 to 26).
In section 2, we describe the marginal logistic model and introduce some common working correlation matrices, namely, the completely independent (CI), the equal correlation (EC), the first order autoregressive (AR1), and the Toeplitz with 2 (TOEP2) and 3 (TOEP3) bands.We interpret these matrices in terms of the relationship between observations.Estimation procedures using the first order GEE approach are described in section 3.In section 4, three data sets are analyzed: the blood glucose data consisting of short time series (N =3), the plasma citrate concentration data consisting of long time series (N =14) and the methadone clinic data consisting of time series of varying length.The first two data sets are obtained from dichotomizing some continuous variables so that the transformed data can be fitted to a logistic regression model.Transformation of data is necessary here to study the robustness of different correlation structures because of the following reasons.Firstly, it has been increasingly popular to transform data to fulfill certain model assumptions.Secondly, simulation studies to evaluate the performance of marginal models adopting different correlation matrices is impossible because it is infeasible to simulate binary data with different working correlation structures.Furthermore, we resist the idea of simulation because arbitrary setting true model parameter values in a simulation experiment fails to reflect the complicated correlation structure in real data.In section 5, results from model fitting using GEE approach are analyzed and compared.Finally, in section 6, a conclusion is drawn on the best correlation structures that give consistently the best fit regardless of the length of time series and strength of autocorrelation.

Marginal logistic model
Longitudinal binary data are common and logistic regression is often used to model such data.Let Y mn denote the outcome of the n-th measurement from subject m where m = 1, . . ., M , n = 1, . . ., N m , M is the number of subjects and N m is the number of observations for subject m.There are two approaches to handle the serial correlation in Y mn .The first approach uses a conditional AR1 model where the previous observation Y m,n−1 is entered as a covariate in the mean model of Y mn .The conditional probabilities P c,mn = Pr(Y mn = 1|Y m,n−1 ) are modelled as logit-linear in some covariates, that is, However such approach limits the dependency structure to be an autoregressive type and the model is incapable of predicting future events for a given set of covariates.The second approach adopts a marginal model and uses a GEE approach with a working correlation matrix Σ 0 to describe the dependency structure between observations.
Here the marginal probabilities P mn = Pr(Y mn = 1) are modelled as logit-linear in some covariates, that is, where η mn is a linear function of covariates such that P mn = e ηmn 1 + e ηmn .The association across time of the repeated outcomes for a subject is treated as nuisance and is entered only in a working correlation matrix that appears in the estimating equations.Although it is known that parameter estimates are robust to misspecification of the true correlation matrix, our experience with data fitting shows disagreement.Discrepancy between certain significant parameter estimates can be as large as 30% among models adopting different working correlation matrices.See the parameter estimates with '*' in Table 2. Moreover as the actual correlation structure is often unknown in practice, we propose to adopt different working correlation matrices to model the dependency structure between observations explicitly and compare between results.The working correlation matrix of the best fitted model reveals valuable information regarding the dependency between observations.For N = 4, the working correlation can be modelled by the following J = 5 types:

Structure Properties
Completely independent (CI) First-order autoregressive (AR1) where p ρ is the number of correlation parameters which is fixed except TOEP (p ρ = N − 1) and the error For a larger N , the J = 5 types of correlation matrices can be similarly defined.Some of these correlation structures exhibit special properties.For example, EC corresponds to a random intercept model and AR1 an exponential correlation model.For equally spaced measurement times such that t n+1 − t n is a constant for all n, σ nn = σ 2 ρ |n−n | implies that the covariance between a pair of measurements on the same subject decays to zero as the time separation between the measurements increases.AR1, TOEP2 and TOEP3 assign equal correlation to observations of equal time lag.Practically, correlation between observations with large time lag is very low.Hence TOEP is usually set to contain just a few bands and this will not eliminate some highly possible correlation structures.For more details about the interpretation of some of these correlation structures, see Diggle et al. (1996).

Estimation using GEE approach
Prentice (1988) proposed the GEE approach using a Taylor series expansion to approximate the likelihood function.The estimating equation is where where Σ 0 is the working correlation matrix and the variance estimates S 2 mn = P mn (1 − P mn ) = e ηmn /(1 + e ηmn ) 2 .We estimate different correlation coefficients by for EC, (3) 1 2 for TOEP3.
(5) since ρ c , ρ 1 , ρ 2 are correlations between any pairs, only lag-1 pairs and only lag-2 pairs of observations respectively and they are estimated by correlations based on corresponding pairs in the sample.Note that ρ is correlation between any pairs allowing for the number of lag and it is estimated by sample correlation between successive (lag-1) pairs only.With estimates ρ 2 in iteration l, we can update β (l) to β (l+1) by solving (1) using the Newton-Raphson method.Then ρ l+1) in (3) to (5) and the cycle repeats again until convergence is reached.Finally, the covariance matrix of the proposed estimator β (Prentice, 1988) equals where Cov(Y m ) is given by (Y m − P m )(Y m − P m ) T .

Empirical studies 4.1 Blood glucose data
Data of inter and intra individual variation of blood glucose levels (Andrews and Herzberg 1985, P.211) obtained from registrants for pre-natal care at Boston City Hospital, USA, are used to estimate the variation of blood sugars on pregnant and non-pregnant women.There are 53 nonpregnant women, each receiving annual glucose tolerance test over a period of six years of which six fasting blood glucose tests were conducted and their one hour post blood glucose concentrations were measured.There are also 52 pregnant women, each having three fasting blood glucose tests and their one hour post blood glucose concentrations measurements.Measurements are in mg/100 ml.The variables in the data set are Dependent variable: Y : the indicator of whether the fasting glucose level X 1 is less than the 1-hour post blood glucose level Y 1 by more than 10 mg/100 ml, i.e.I(X 1 − Y 1 < 10) Independent variables: X 1 : the fasting blood glucose level, X 2 : the indicator of pregnancy.This data set is used in the analysis of short time series and it contains K = 315 (3 × (53 + 52)) observations coming from M = 105 subjects each repeatedly measured N = 3 times.For the 53 non-pregnant women, only the first three fasting and one hour post blood glucose levels are used in the analysis.The counts of M and K and the means of Y 1 , X 1 and Y across the non-pregnant and pregnant groups of women are presented in Table 1.The marginal probabilities P mn = Pr(Y mn = 1) are modelled as logit-linear in the covariates, that is, logit where (β 0 , β 1 , β 2 ) are the parameters for the intercept, X 1 and X 2 respectively.For model comparison, the conditional AR1 model ( CAR1) where the conditional probabilities P c,mn = Pr(Y mn = 1|Y m,n−1 ) and Y m,0 = 0, is also considered together with the marginal models with 5 different working correlation matrices.Table 2 shows that all parameters are significant and that a decrease in fasting blood glucose level and the state of pregnancy are significantly associated with a higher probability of increasing the one-hour post blood glucose level by more than 10 mg/100 ml.-0.2197 0.3768 -0.0108 0.00569 -0.3434 0.0736 0.4310 -0.74800 TOEP2 -0.2614 0.4017 -0.0107 0.00605 -0.3282 0.0745 0.4424 -0.72970 TOEP3 -0.1900 0.3636 -0.0104 0.00549 -0.3683 0.0752 0.4422 0.3533 -0.77870 Note: The larger parameter estimate β 2 with '*' is 30% more than the smaller parameter estimate for the plasma citrate concentration data.'-' refers to parameter not existing in the model.

Plasma Citrate Concentration data
An experiment involving 10 subjects (Andrews and Herzberg, 1985, P.237) was carried out to study the variation of plasma citrate concentration during a day.For each subject, the concentration of citrate in plasma (in µmol per litre) was measured hourly at 14 time points from 8am to 9pm during a day, with a total of K = 140 observations.Meals were given at 8am, at noon and at 5pm.The binary outcome of the logistic model is Y , an indicator of whether the plasma citrate concentration (Y 1 ) is more than 120 µmol per litre.Covariates include the time of the day (X 1 ) taking values from 1 to 14 and an indicator of meal time (X 2 ).A summary for the counts M and K and the means of Y 1 and Y across observations taken at non-meal and meal times are reported in Table 1.Model parameters are (β 0 , β 1 , β 2 ) for the intercept and the 2 covariates.
Again the CAR1 model with an autoregressive parameter β ρ for the previous outcome Y m n−1 and the marginal models with 5 different working correlation matrices are fitted to the data.However model with TOEP2 does not have stable parameter estimates.Table 2 shows that for all except the CAR1 models, only the indicator of meal time (X 2 ) is significantly associated with a higher probability of having the concentration of citrate in plasma in excess of 120 µmol per litre.

Methadone clinic data
The methadone clinic data set consists of records of drug users under methadone maintenance treatment program at a clinic in Western Sydney in 1986.Outcomes Y are the weekly urine test results which are positive or negative for morphine, a biological marker for heroin use.Methadone dose d in mg (X 1 ) at the time of urine test is included as a predictor variable, as is the log of treatment duration ln t in weeks (X 2 ) because previous research revealed that the methadone dosage and the duration of treatment are significant treatment factors.There were M = 136 heroin users, submitting a total of K = 2872 urine screens and each urine screen result serves as the unit of analysis.The average number of treatment weeks per heroin user is 21.1 with each user submitting 4 to 26 weekly outcomes (4 ≤ N m ≤ 26).51 of them dropped out prematurely and the rest having 26 outcomes are regarded as having completed the program.Table 1 displays the counts of M and K and the means of Y and X 1 across the non-dropout and dropout heroin user groups.For more detailed description of the data set, see Chan et al. (1998).
The marginal logistic regression model is: where d mn is the dosage administered to patient m at time n.Table 2 gives the results for the CAR1 model and 5 marginal models.In general, the dose effect is only marginally significant.The conclusion is qualitatively the same as other researches conducted on this data (Chan et al., 1998), namely, increase in methadone dose is associated with reduced heroin use and heroin use is also reduced over time.Moreover the significant autoregressive parameter β ρ in the CAR1 model indicates that a strong and positive association between the present Y it and previous Y i t−1 outcomes, suggesting that some patients in treatment tend to use heroin continuously while others do not.

Results
In order to compare the performance of marginal logistic models with different correlation matrices, the posterior expected utility where P mn = e ηmn 1+e ηmn and ηmn = βo + β1 X mn1 + β2 X mn2 for marginal models, βo + β1 X mn1 + β2 X mn2 + βρ Y m n−1 for CAR1 model, proposed by Walker and Gutiérrez-Peña (1999) is evaluated for each model.The U is the average of the natural logarithm of the probability densities of data.Obviously, a less negative U indicates a higher likelihood of the model given the data.Note that the nuisance parameters in the working correlation matrix of the marginal models do not enter explicitly into the calculation of P mn .However, as they appear in the estimating equations, they affect the parameter estimates ( β0 , β1 , β2 ).
For the blood glucose data, Table 2 shows that the estimates of ρ are very small for all marginal models and the autoregressive parameter β ρ in the CAR1 model is also insignificant.This shows that the strength of serial correlation is weak possibly because a short time series cannot adequately reveal the dependency structure for binary data.Hence the U values are similar across conditional and marginal models.Marginal model adopting a CI correlation matrix has the least negative U value showing that observations within patients are essentially independent.
For the plasma citrate concentration data, the ρ(s) are much larger for all marginal models and the autoregressive parameter β ρ in the CAR1 model is strongly significant.This shows that the strength of serial correlation is much stronger.Clearly the ρ c in the correlation matrix of EC type is generally smaller than the ρ in the AR1 matrix or the ρ 1 in the TOEP2 and TOEP3 matrices because EC assumes equal correlation independent of the time-lag between any two observations.Moreover, ρ 1 in the TOEP3 matrix which describes the correlation between observations of 1 time-lag is generally larger than ρ 2 which reveals the correlation between observations of 2 time-lag.Since the CAR1 model gives the least negative U value, it again confirms the strong serial correlation between observations within patients.However, across marginal models, the U value is least negative for model with CI correlation matrix but this value is very close to that of model with EC matrix.For this data, we believe that the EC matrix is the best correlation matrix to describe the dependency structure between observations from the same subject.
Lastly, for the methadone clinic data, Table 2 shows that the ρ(s) are moderate for all marginal models and the autoregressive parameter β ρ in the CAR1 model is also significant.The U values differ substantially across models possibly because of the generally longer time series and their specific dependency structures.The marginal model adopting an EC correlation matrix gives the least negative U value again confirming the strong serial correlation between observations within patients but such serial correlation does not decrease as the time-lag between observations increases.Again the EC matrix is the best correlation matrix for this data to describe the dependency structure between observations from the same patient.

Conclusion
Although the GEE approach has been proposed for more than twenty years, few studies investigate choices of working correlation matrices for data with different correlation structures, possibly because parameter estimates for marginal models using a GEE approach are known to be robust against misspecification of working correlation matrix.However our experience with real data analyses shows the contrary: parameter estimates (β 0 , β 1 , β 2 ) as well as the model fit measure U differ substantially across models adopting different working correlation matrices.Results from three real data analyses suggest that the EC matrix is an appropriate choice of working correlation matrix if the time series is short or the serial correlation does not decrease sharply with time.This is perhaps due to the limited information in binary data that favors simple correlation structures.If computationally feasible, fitting models with all the five working correlation matrices and choosing one with the least negative posterior expected utility U is also a good practice.

Table 1 :
Summary of the three data sets

Table 2 :
Parameter estimates with SE in italic for the three data sets