Skew-normal Linear Mixed Models

Normality (symmetric) of the random effects and the withinsubject errors is a routine assumptions for the linear mixed model, but it may be unrealistic, obscuring important features of amongand within-subjects variation. We relax this assumption by considering that the random effects and model errors follow a skew-normal distributions, which includes normality as a special case and provides flexibility in capturing a broad range of non-normal behavior. The marginal distribution for the observed quantity is derived which is expressed in closed form, so inference may be carried out using existing statistical software and standard optimization techniques. We also implement an EM type algorithm which seem to provide some advantages over a direct maximization of the likelihood. Results of simulation studies and applications to real data sets are reported.


Introduction
The linear mixed model (LMM) have become a most commonly used for analyzing continuous repeated measures data from a sample of individuals in agricultural, environmental, biomedical, economical, and social science applications.Let Y j be a (n j × 1) vector of observed continuous responses for sample unit j, j = 1, . . ., m.We assume that Y j follows the general LMM: where X j of dimension (n j × p) is the design matrix corresponding to the fixed effects, β of dimension (p × 1) is a vector of population-averaged regression coefficients called fixed effects, Z j of dimension (n j × q) is the design matrix corresponding to the (q × 1) random effects vector b j , and j of dimension (n j × 1) is the vector of random errors.A standard but possibly restrictive assumption is that the random effects b j and the residual components j are independent with b j iid ∼ N q (0, D), j ind ∼ N n j (0, ψ j ), (1.2) where D = D(α) and ψ j = ψ j (γ), j = 1, . . ., m, are dispersion matrices, usually associated with the variability among-and within-individuals, which are depending on unknown and reduced parameters α and γ, respectively.Frequently, estimation methods for the parameters in model (1.1)-(1.2) are maximum likelihood and restricted maximum likelihood (Lindstrom and Bates, 1988).Confidence intervals and hypothesis testing for the parameters are generally based on asymptotic results and softwares such as SAS proc mixed (Littell, Milliken, Stroup and Wolfinger, 1996) or S-plus lme (Pinheiro and Bates, 2000) incorporate procedures for analyzing LMM under this assumptions.
Though model (1.1)-(1.2) offers great flexibility for modelling these effects, it suffers from the same lack of robustness against departures from distributional assumptions as other statistical models based on the Gaussian distribution and may be too restrictive to provide an accurate representation of the structure that is present in repeated measures and clustered data.From a practical point of view, the most commonly adopted approach to achieve multivariate normality involves variables transformation.Although such methods may give reasonable empirical results, it should be avoided if a more suitable theoretical model can be found (Azzalini and Capitanio, 1999).Thus, considerable interest has focused on relaxing the normality assumption and jointly estimating the random effects and model parameters.In this context, recent proposals have been made based in replacing the assumption of normality by a weaker assumption that only the random effects have a "smooth" density that may be skewed.For example, Zhang and Davidian (2001) use the semiparametric (SP) density representation proposed by Gallant and Nychka (1987) to characterize the random effect density; an appealing feature of this approach is that the likelihood for all model parameters may be expressed in a closed form.Alternatively, Verbeke and Lesaffre (1996) adopt a mixture of normals representation and carry out inference via an EM algorithm (see Verbeke and Molemberghs, 2000, chap. 12).Magder and Zeger (1996) use a form of non-parametric maximum likelihood based normal densities and uses somewhat ad hoc fitting and assessment of the fit.Tao, Palta, Yandell, and Newton (1999) estimate the density of a scalar random effect via a predictive recursive algorithm.Recently, Sahu et al. (2003) proposed and alternative model suitable for Bayesian implementation, which seems not to be adequate for maximum likelihood implementation.We propose an alternative method that is particularly attractive for linear mixed models by assuming that both the random effect and the model errors follow a skew-normal distribution.Our approach may offer advantages of more efficient estimators and algorithms (for special cases) and also of practical interpretation for model parameters.It also has the advantage of providing readily available information matrices.
The plan of the paper is as follows.In Section 2, for the sake of completeness, we consider a multivariate extension of the univariate skew-normal distribution proposed by Azzalini (1985).Properties like moments and stochastic representation of this multivariate distribution are also discussed.In Section 3 the skewnormal linear mixed model (SNLMM, hereafter) is defined extending the usual normal mixed model.The marginal density of Y j is obtained analytically by integrating out the random effects b j , j = 1, . . ., m, leading to the observed (marginal) likelihood function that can be maximized directly by using existing statistical softwares such as Ox, R or Matlab.We point out that the analytical expression for the likelihood function provided in this paper is not available elsewhere in the literature.Section 4 presents an EM type algorithm which presents advantages over the direct maximization approach, specially in terms of robustness with respect to starting values and a closed form to estimating β in the iterative process.Section 5 reports results of a simulation study and Section 6 reports applications to a real data set indicating the usefulness of the approach.

A Skew-normal Distribution
In this section we introduce the multivariate skew-normal distribution that will be used in defining the SNLMM considered in the following section.We start by giving an important notation that will be used through the whole paper and presenting a review of the bibliography in univariate skew-normal models.
Let φ n (x|µ, Σ) and Φ n (x|µ, Σ) be the probability density function (pdf) and the cumulative distribution function (cdf), respectively, of the N n (µ, Σ) distribution evaluated at x.When µ = 0 and Σ = I n (the n × n identity matrix), we denote these functions as φ n (x) and Φ n (x).
As considered in Azzalini (1985), a random variable Y follows a univariate skew-normal distribution with location parameter µ, scale parameter σ 2 and skewness parameter λ if the pdf of Y is given by Note that if λ = 0 then the density of Y in (2.1) reduces to the density of the normal distribution.We use the notation Y ∼ SN 1 (µ, σ 2 , λ) to denote this distribution, which will be reduced to Y ∼ SN 1 (λ) when µ = 0 and σ 2 = 1.Properties of this distribution can be found in Azzalini (1985) and Henze (1986).
Studies on multivariate skew-normal distributions are considered in Azzalini and Dalla Valle (1996), Azzalini and Capitanio (1999), Branco and Dey (2001), Sahu et al. (2003), among others.Arellano-Valle, del Pino and San Martin (2002) show that many of the properties of the multivariate skew-normal distribution hold for a general class of skewed distributions obtained from a symmetric class, defined in terms of independence conditions on signs and absolute values and give general formula to obtain skewed pdf's.From these results, Arellano-Valle and Genton (2005) introduce the class of fundamental skewed distributions, giving an unified approach to obtain multivariate skew distributions starting from symmetric ones.In this work, we consider a special case of the fundamental skew-normal distribution, proposed by Arellano-Valle and Genton (2005) (see also Azzalini andDalla-Valle, 1996 andAzzalini andCapitanio, 1999).The definition is given in the following.
Definition 1: An n-dimensional random vector Y follows a skew-normal distribution with location vector µ ∈ R n , dispersion matrix Σ (a n×n positive definite matrix) and skewness vector λ ∈ R n , if its pdf is given by (2.2) We denote this by Y ∼ SN n (µ, Σ, λ) and by Y ∼ SN n (λ) when µ = 0 and Σ = I n , the n-dimensional identity matrix.
Remark 1: Since the condition that Φ 1 (−w) = 1 − Φ 1 (w) for all w ∈ R is sufficient to guarantee that (2.2) is a pdf, we can then use different reparameterizations to represent the asymmetric parameter λ, as for example: for some δ ∈ R n and positive definite n × n matrix ∆ such that δ T ∆ −1 δ < 1.Two special cases can be considered; ∆ = Σ, which is just the reparameterization used by Azzalini and Dalla-Valle (1996), and ∆ = I n , which is used in Arellano-Valle and Genton (2005).In a more general way, we can replace in (2.2) the asymmetric part (or skewing function; see Genton and Loperfido, 2001) Φ 1 (•) by an arbitrary function Q(•) on [0, 1], which depends on y trough an even real function (or antisymmetric function; see Arellano-Valle and del Pino, 2003), say w(y), and is such that . Thus, the skewnormal distribution in (2.2) can be extended by considering (2.4) Many properties of the above skew-normal distribution may be derived from the results developed by Arellano-Valle and Genton (2005) (see also Arellano-Valle et al., 2002 andArellano-Valle anddel Pino, 2003).From there it follows, for example, the stochastic representation given next for an standardized skewnormal random vector.
In the appendix we provide a proof of this proposition.Notice that the stochastic representation give in Henze (1986) for the univariate case is a special case of (2.5).Thus, we have extended the univariate skew-normal distribution given in (2.1) in a nice way for the multivariate case.In Figure 1 we present some contours of the density associated with the bivariate skew-normal distribution SN 2 (0, Σ, λ) for different values of Σ and λ.Note that these contours are not elliptical and can be strongly asymmetric depending on suitable choices of the parameters.A direct consequence of Proposition 1 is given in the following corollary.

A SNLMM Likelihood Function
In order to allow symmetric-asymmetric properties in characterizing features of real data sets, the SNLMM is defined by extending the normal mixed model in (1.1)-(1.2) by considering that with b j independent of j , which by using Corollary ??, leads to the following hierarchical model: where . ., m, which must be considered in order to obtain a correct interpretation of the model parameters.However, in practice we can in general rescue the common interpretation, correcting the intercept parameter in the fitted model, as will be done in Section 4. The main interest is to make inference on the parameter vectors θ = (β T , α T , γ T ) T and λ = (λ T b , λ T e1 , . . ., λ T em ) T .As discussed in Verbeke and Molenberghs (2000), unless the data are analyzed in a Bayesian framework, inference in this type of models has to be based on the marginal distribution for the response Y j .The marginal density of Y j is obtained in the following theorem, the proof is given in the appendix. where (2.7) Note that the likelihood (2.5) is not in the class defined here, since the skewness factor in this expression is of dimension 2. It is, however, in the class of fundamental skew-normal distributions considered by Arellano-Valle and Genton (2005) (see also, (2.4) in Remark 1).The result presented in Theorem is important because it avoids using more complex numerical techniques such as Monte Carlo integration to carry out inference in this type of models, given that it allows a closed form for the marginal distribution of Y j , j = 1, . . ., m, facilitating straightforward implementation of inferences with standard optimization routines.Thus, denoting the log-likelihood function by (θ, λ), it can be written as where µ 1j , µ 2j , Σ j , Γ j and Λ j as defined in (2.6) and (2.7).
We call attention to the fact that no explicit solution is available for the maximization problem so that the likelihood function has to be maximized numerically.Some special cases may be of interest.For instance, the situation where λ e1 = . . .= λ em = 0 or λ b = 0, which are special cases of the above general situation.These situations are treated next.
Corollary 2: Under the conditions of Theorem , it follows that: Although simpler, the log-likelihood functions that follow from (2.9) and (2.10) must also be maximized numerically.The asymptotic covariance matrix of the maximum likelihood estimators (MLE) can be estimated by using the Hessian matrix, which can also be computed numerically by using the program R, for example.In the next section we present an EM-type algorithm for computing the MLE of densities obtained in Corollary 2.

An EM-type algorithm
A direct maximization of the likelihood (2.9) and (2.10) may sometimes poses problems since it involves terms like log(Φ 1 (w)), which causes computational problems for negative w (w < −40, for example).Further, the approach seems not too robust with respect to starting values, that is, unless good starting values are used, the direct maximization approach will typically not converge.Simulation studies conducted indicate the EM to be more robust in the sense that it may converge more often than the direct maximization approach.
The EM algorithm (Dempster, Laird, and Rubin 1977) is a popular iterative algorithm for ML estimation in models with incomplete data.More specifically, let y denote the observed data and t denoted the missing data.The complete data y comp = (y, t) is y augmented with t.We denote by c (θ c ), θ c ∈ Θ, the complete-data log-likelihood function and by Q(θ c |θ c ) the expected completedata log-likelihood Each iteration of the EM algorithm consist of two steps, the expectation step and the maximization step: Each iteration of the EM algorithm increases the likelihood function (θ c ) and the EM algorithm typically converges to a local or global maximum of the likelihood function.When the M-step in the EM algorithm is difficult to implement, it is useful to replace it with a sequence of constrained maximization (CM) steps, each of which maximizes Q(θ c |θ (r) c ) over θ c with some function of θ c held fixed.The sequence of the CM-steps is such that the overall maximization is over the full parameter space.This leads to a simple extension of the EM algorithm, called the ECM algorithm (Meng and Rubin, 1993).In this work we implemented the ECM algorithm which irrespectively will be called EM-algorithm.
In order to implement the two steps of the EM-algorithm for maximizing the likelihood from Corollary , we need first some additional results.The proofs are given in the appendix.
Proposition 2: Suppose that Y|T = t ∼ N n (µ + dt, Ψ) and T ∼ HN 1 (0, 1) (the standardized half-normal distribution).Let Σ = Ψ + dd T .Then the joint distribution of (Y T , T ) T can be written as where Notice that the marginal distribution of Y follows from (4.1) after integrating out t and is given by Proposition 3: Under the conditions in Proposition , where X ∼ N 1 (η, τ 2 ), with η and τ 2 given in (4.2).Particularly, and 4.1 EM algorithm when λ e1 = . . .= λ em = 0 Under this assumption we have the following SNLMM: It is clear that, (4.7) jointly with Proposition 1 implies that where X 0j iid ∼ N 1 (0, 1), X 1j iid ∼ N q (0, I q ), with X 0j and X 1j independent j = 1, . . ., m, and . Moreover, independence between b j and j , j = 1, . . ., m, imply that V j = (X 0j , X T 1j ) T and j , are independent, j = 1, . . ., m.Hence, replacing (4.8) in (4.6) we have that which are such that and are independent, j = 1, . . ., m.Note that r j has mean zero, so that it can be used, for example, in residual analysis to check model adequability.Besides, the second term on the right side of equation (4.9) has mean 2 π Z j δb which can be used to correct the model intercept so that the fixed effects have the same interpretation as in the usual LMM (population average).
Therefore, (4.9) and (4.10) imply that the model defined by (4.6)-(4.7)can be written as . . . , m, (4.11)where Note that in (4.12) µ j and Σ j are the marginal mean vector and covariance matrix, respectively, under the usual linear mixed model.Let y T = (y T 1 , . . ., y T m ) T and t = (t 1 , . . ., t m ) T , as a direct consequence of Proposition using simple algebra we have the next result.
Proposition 4: Under (4.11) it follows that the complete log-likelihood function associated with (y, t) in the SNLMM (4.6)-(4.7),can be written as (4.13) where by (4.2) with µ j , d j , Σ j and Ψ j as defined in (4.12). Letting , we obtain from Preposition ?? that where η j and τ 2 j as in (4.14).We then have the following EM algorithm:

EM algorithm when λ b = 0
Likewise, considering the case where λ b = 0, that is, the linear mixed model in (4.6), with the assumption that j ind ∼ SN n j (0, ψ j , λ ej ) and b j iid ∼ N q (0, D), j = 1, . . ., m, (4.19) all independent, we can write where this is ∼ N n j (µ j + d j t j , Ψ j ) and t j ∼ HN 1 (0, 1), j = 1, . . ., m, (4.21) where As a consequence of the above results, it follows from Proposition 2 that: Proposition 5: Under (4.21) it follows that the complete log-likelihood function associated with (y, t) in the SNLMM (4.6) and (4.19), can be written as (4.23) where with µ j , d j , Σ j and Ψ j defined as in (4.22).
The EM algorithm in this case proceed as in (4.15)-(4.18),with η j and τ 2 j as in (4.24).Note that in both cases the M-step require numerical maximization which can be easily implemented using statistical software as Ox, R and Matlab with bfgs, optim and fmincon routines, respectively.The starting values are often chosen to be the corresponding estimates under a normal assumption, where the starting values for the asymmetric parameters are set to be 0 and as recommended in the literature, it is useful to run the EM-algorithm several times with different starting values.
Following other authors (e.g.Zhang and Davidian, 2001 ) we propose evaluate a series of fits by inspection of information criteria such as Akaike's Information Criterion (AIC, − ( θ, λ)/N + P/N ), Schawarz's Bayesian Information Criterion (BIC, − ( θ, λ)/N + 0.5 log(N )P/N ), and the Hannan-Quinn Criterion (HQ, − ( θ, λ)/N + log(log(N ))P/N ), where P is the number of free parameters in the model and N = m j=1 n j as we demonstrate shortly in the next section.

Simulation Study
To assess the performance of the proposed model and methods, we conducted two simulation studies.In all cases, we took the linear mixed model to be (5.1) where for i = 1, . . ., 5, , and e ij ∼ N (0, 0.5 2 ).In each simulation, 100 Monte Carlo data sets were simulated from (5.1) according to the additional specifications described below.As an advantage of the skew-normal representation for the random effects is its propensity for accommodating skewness, in the first simulation we generated the α + b j according to a Gamma(4, 1) distribution with probability density (1/6)x 3 exp(−x), yielding a highly skewed distribution as suggested by Figure 2.Under this specification, E[α + b j ] = 4, and V ar[α + b j ] = 4.Note that t ij represents a covariate with values changing within individuals and the same for all individuals, while w j is the individual level-covariate, e.g., a treatment indicator.We took m = 200 with w j = 1 if j ≤ 100 and w j = 0 if j > 100.For each of 100 Monte Carlo generated data sets, (5.1) was fit two times under the assumption of previous section, with the density of b j represented by the skew-normal distribution and also by the normal distribution.For evaluating the objective use of the criteria, the model preferred by each of AIC, BIC and HQ was recorded.

An Application
We illustrate the usefulness of the proposed methods by applying them to longitudinal data on cholesterol levels collected as part of the famed Framingham heart study.The file includes the cholesterol levels over time, age at baseline and gender for m = 200 randomly selected individuals, reported in Zhang and Davidian (2001).We adopt the same linear mixed model used by these authors, given by where y ij is cholesterol level divided by 100 at the i-th time for subject j and t ij is (time − 5)/10, with time measured in years from baseline; age j is age at baseline; sex j is the gender indicator (0 = f emale, 1 = male).Thus, shows the histogram of cholesterol levels, clearly indicating its asymmetric nature and that it seems adequate fitting a skew-normal model to the data set.Zhang and Davidian (2001) analyzed this data and show that the asymmetric behavior is partially explained by the available covariates and the random effects may not be normally distributed.Based in this information, three statistical models, differing in the error term and random effects distributions, are entertained.These models are: Model 1: A model with independent multivariate normal distribution for the errors and multivariate skew-normal distribution for random effects with λ b = (λ b1 , λ b2 ) T ; Model 2: A model with independent multivariate skew -normal distribution for random random errors with common shape parameter between groups and multivariate symmetric normal distribution for the random effects; and Model 3: A purely Gaussian model.
For the EM algorithm, none of AIC, BIC, or HQ selected the normal specification for any of the 100 data sets, demonstrating the ability of these selection methods to detect an obvious departure from normality and suggesting strong evidence of skewness.Table ?? shows the numerical results for the estimates where normality and skew-normal was considered.For the most part, parameter estimates are unbiased.In the most cases, the average of estimates standard errors agrees well with the Monte Carlo standard deviation.As found by other authors (e.g., Tao et al., 1999;Zhang and Davidian, 2001), efficiency of estimation on β 2 associated with the individual-level covariate w j is degraded when normality is assumed relative to allowing a more flexible representation via the skew-normal distribution.Because one of the main focuses of such analysis may well be evaluation of treatment effect, this suggests that adopting the normality assumptions routinely may lead to inefficient inferences on fixed effects of primary interest.Note that some efficiency loss is also associated with estimation of the inter-individual variance V ar[b j ].Thus, although unbiased estimation is still possible under normality, failure to take appropriate account of the true features of the random effects leads to less precise inference on what are usually quantities of key interest.
The advantage of estimating the random effects density may be appreciated from Figure 2. Figure 2(a) shows the Monte Carlo average of density estimates over the 100 data sets along with the true density, the symmetric-normal fit and the fit for the skew-normal.The figure demonstrates that the additional flexibility afforded by the skew-normal representation is sufficient to capture quite accurately the true underlying features of the random effects.This observation is further supported by Figure 2(b), which shows the 100 density estimates from the skew-normal fits.The second simulation was identical to the first except that the true distribution of α + b j was instead N (4, 4), that is, α = 4 and b j ∼ N (0, 4).Here, then, there is no need for greater flexibility on the random effect, and the hope would be that the proposed methods would identify this.The AIC, BIC and HQ criteria correctly selected the normal distribution 86%, 98%, and 93% of the time, respectively.Summaries of the Monte Carlo result are given in Table 2, which shows that there is no efficiency loss associated with using the skew-normal distribution.The apparent conclusion is that the price to pay for estimating the random effects density when the normality assumption holds is mild; similar results are reported in Hu, Tsiatis, and Davidian (1998) and Zhang and Davidian (2001).
In all cases we considered ψ j = σ 2 e I n j , j = 1, . . ., 200 (conditional independence).Table 3 presents the results obtained using the EM-type algorithm of the three models described above, SE are the estimated asymptotic standard errors based in the Hessian matrix computed numerically.When considering Model 2 (only random errors are asymmetrically distributed), asymmetry is not detected and parameter estimates are close to the ones obtained under normality (Model 3).The AIC, BIC and HQ criteria indicate that Model 1 presents the best fit, supporting the contention of a departure from normality.Estimates of the individual-level covariate coefficients β 1 , β 2 and β 3 differ as well, reflecting the qualitative behavior seen in the simulations.Figure 3(b)-(d) shows the estimated random effects distribution which is obvious skewness that is particularly evident in the contour plot showed in Figure 3(b).Figures 3(c) and (d) show the estimated marginal densities of b oj and b 1j , respectively.Note that the distribution of slopes appears normal, while the shape of the density for intercepts shows evidence of skewness as Zhang and Davidian (2001) concluded using the SNP representation.

Final Conclusion
In this paper we developed a skew-normal mixed models for fitting regression model with dependent data.We believe that this is the first attempt in working in such general distributional structure for mixed models and that the approach used in this paper can be used in treating other multivariate models which will be the subject of incoming papers.An analytical expression (closed form) is obtained for the marginal distribution of the observed response vector which allows carrying out inferences using standard optimization techniques and existing statistical softwares.For evaluation of the MLE, an EM-type algorithm is developed by exploring statistical properties of the model considered, and as is typical for the EM algorithm, reliability rather than speed is its best feature.An small simulation study is also presented where as observed in other contexts and approaches (e.g., Zhang and Davidian, 2001), there is potencial to gain efficiency in estimating certain parameters when the normality assumption does not hold, with only a small price to pay for the extra complications in the assumptions.An additional major advantage of all approaches that relax the assumptions on the random effects and the model errors densities is the insight the estimates provide.
We have implemented the approach using MATLAB with the fmincon optimizer routine; code is available from the authors upon request.

Figure 2 :
Figure 2: Simulation results based on 100 data sets.(a) True density (solid line) and Monte Carlo average estimated densities for 100 data set: using normal (dashed-dotted) and using skew-normal (dotted-line) (b) Estimated densities for the skew-normal fits.

Figuree 3 :
Figuree 3: (a) Histogram of cholesterol levels for 200 subjects of Framingham cholesterol study.(b) Contour plot of estimated density of b j .(c) and (d) Corresponding estimated marginal densities for components of b j (solid) with estimated normal density (dotted) superimposed.

Table 1 :
Monte Carlo results based on 100 data sets, true Gamma(4, 1) distribution for the random effects.MC Mean and MC SD are average and standard deviation of the estimates, AVE SE is average of estimated standard errors.EC is the empirical coverage probability of the 95% confidence intervals of the estimates.True values of parameters are in parentheses.

Table 2 :
Monte Carlo results based on 100 data sets, true N ormal(0, 4) distribution for the random effects.Entries are as in Table1.

Table 3 :
Results of fitting models 1, 2 and 3 to the cholesterol data.d 11 , d 12 and d 22 are the distinct elements of the matrix D 1/2