Variable Selection in the Chlamydia Pneumoniae Lung Infection Study

In this study, the data based on nucleic acid amplification techniques (Polymerase chain reaction) consisting of 23 different transcript variables which are involved to investigate genetic mechanism regulating chlamydial infection disease by measuring two different outcomes of muring C. pneumonia lung infection (disease expressed as lung weight increase and C. pneumonia load in the lung), have been analyzed. A model with fewer reduced transcript variables of interests at early infection stage has been obtained by using some of the traditional (stepwise regression, partial least squares regression (PLS)) and modern variable selection methods (least absolute shrinkage and selection operator (LASSO), forward stagewise regression and least angle regression (LARS)). Through these variable selection methods, the variables of interest are selected to investigate the genetic mechanisms that determine the outcomes of chlamydial lung infection. The transcript variables Tim3, GATA3, Lacf, Arg2 (X4, X5, X8 and X13) are being detected as the main variables of interest to study the C. pneumonia disease (lung weight increase) or C. pneumonia lung load outcomes. Models including these key variables may provide possible answers to the problem of molecular mechanisms of chlamydial pathogenesis.


Introduction
Chlamydophila pneumonia (C.pneumonia), an obligate intracellular bacterial pathogen, is the most common chlamydial pathogen that causes communityacquired respiratory infections.Although the infection is typically mild acute, it is strongly associated with atherosclerotic coronary heart diseases.C. pneumonia infection may also lead to chronic, persistent infections with several possible disease outcomes such as chronic inflammatory diseases of presumably noninfectious etiology (Cannon et al., 2005;Wang, 2005;Frikha-Gargouri et al., 2008).
The complex interaction between chlamydial replication and host response determines the outcome of the infection not by a single all-influencing factor but rather by a series of accumulated host-and pathogen-associated factors.Also, human studies indicated a major influence of host genetics on disease severity following chlamydial infection (Wang et al., 2008).
Since C. pneumonia can lead to severe clinical disease, correct diagnosis and therapy are important issues.However, conventional assays for the detection of C. pneumonia have limitations, and there is a need for more accurate, convenient and rapid diagnostic methods.Nucleic acid amplification techniques (such as polymerase chain reaction) have such a potential to offer clinical laboratories a convenient means to detect C. pneumonia and ensure optimal, timely and appropriate clinical decisions and patient care (Boman et al., 1999).In this study, the researchers choose 23 different transcript variables measured at a molecular level based on the modified polymerase chain reaction technique.These transcript variables are the key markers of the immune and inflammatory response and play pivotal roles in the regulation of the immune response to chlamydial infection or play a key role in a protective host response to chlamydia infection (Wang et al., 2008).Usually, the evaluations of protection from this disease are determined by survival of mice and lung weight increase, as well as elimination of C. pneumonia organisms by determination of total chlamydial lung loads.The lung weight increase is a reliable measure of disease intensity, and high increases reflect severe disease (Li et al., 2010).Therefore, in our study these two indexes, percent lung weight increase (based on naïve lung weights of 138.4 mg for adult female A/J mice) and the logarithm of total C. pneumonia lung loads are treated as the dependent variables which are the observed results of these 23 transcript variables being manipulated.
In this study, disease intensity and chlamydial lung loads were determined early after rechallenge, i.e., on day 3 and later on day 10 at peak disease, and then mice were sacrificed by CO 2 inhalation 3 days or 10 days after inoculation, and lungs were weighed, snap frozen in liquid nitrogen, and stored at −80 • C until further processing to get the value of the 23 parameters.Therefore, the whole experiment consisted of 16 different groups of each two time points (day 3 and day 10), comprising a total of 320 female mice, i.e., each group contains 10 female mice.The dependent variables are Y 1 lung disease (lung weight increase %) and Y 2 lung C. pneumonia load (i.e., chlamydial lung burden), defined as average percentage of lung weight increase and the log 10 C. pneumonia / lung, respectively.The independent variables are the 23 different transcript variables, which were log 2 transformed.
Variable selection in multivariate analysis is a very relevant step, because the removal of non-informative variables will produce better predicting and sim-pler models and at the same time maintain almost essentially all the information provided by the original set of regressors.The technique can be used to gain a better understanding of the regression relationship through a simplified description of it, or to reduce the number of regressors required for effective predication of the regressand.In either case, further studies involving such regression relationship will be easier and less expensive to carry out if fewer variables are involved.Goals in variable selection methods include: accurate predictions, interpret models−determining which predictors are meaningful, stability−small changes in the data should not result in large changes in the subset of predictors used, the associated coefficients, or the predictions, and avoiding bias in hypothesis tests during or after variable selection.Traditional methods, such as stepwise regression, all-subsets regression, ridge regression, principal component and partial least squares based methods fall short in one or more of these criteria.Modern procedures such as boosting (Freund and Schapire, 1997) forward stagewise regression (Hastie et al., 2007), and LASSO, least absolute shrinkage and selection operator (Tibshirani, 1996), LARS, least angle regression (Efron et al., 2004) improve stability and predictions.
The objectives of this study are to select the multivariate model candidates based on a few well-known selection methods and criteria, and to construct a simple model which can efficiently predict the late disease outcomes with high accuracy.
Descriptions of the variable selection methods employed in this study are given in Section 2. Results of the analysis and conclusion are provided in Sections 3 and 4.

Stepwise Regression Based Variable Selection Methods
One common approach to select a subset of variables from a complex model is stepwise regression.A stepwise regression is a procedure to examine the impacts of each variable on the outcomes step by step.The variables that do not contribute much to the variance explained would be removed.There are several versions of stepwise regression such as forward selection, backward elimination, and stepwise (Al-Subaihi, 2002).
The stepwise regression has its own limitations.When the number of variables is large compared to the number of observations in the data set or a multicollinearity problem is detected among the variables, the stepwise algorithm may not function or end up with throwing nearly all the variables into or out from the model, especially at a low F -to-enter or F -to-remove threshold.

PLS Based Variable Selection Method
The standard multiple regression model defined by the equation where X is a n×p matrix of explanatory variables (predictors), y is a n×1 vector of response variable, β is a p × 1 vector of unknown parameters, and ε is a n × 1 vector of error terms whose rows are identically and independently distributed.
The ordinary least squares (OLS) estimator of β, b OLS , in the model given by (2.1) is the solution of the following optimization problem: In many applications of multiple regression, multicollinearity is inevitable as a result of large number of variables collected by modern technologies of computers, networks, and sensors.Despite having desirable properties, the OLS estimator can have an extremely large variance and results in imprecise prediction when the data are multicollinear.Moreover, solution of (2.2) is not unique when n < p.
Recently, partial least squares (PLS) has become an important statistical tool for modeling relations between sets of observed variables by means of latent variables, which have the "best" predictive power, especially for statistical problems dealing with high dimensional data sets.PLS is a member of nonlinear iterative least squares (NILES) procedures developed by Wold (1966Wold ( , 1975)).In order to deal with multicollinearity and/or dimensionality problem, we regress the response variable y on a subset of the k orthogonal (latent) vectors stored in a score matrix of size n × k by which important features of X have been retained.Score matrix is formed by taking linear combinations of columns of X.
PLS regression constructs the columns of score matrix, So, PLSR balances the maximal correlation criteria for OLS given in (2.1) with the requirement of explaining as much as variability in both X and y -space.
PLS is a distribution free approach to regression and path modeling, robust against other data structural problems such as skew distributions and omission of regressors.It is a useful tool to reveal a few underlying predictive factors that account for most of the variation in the response (Geladi, 2005;Cassel et al., 1999).
Usually to explore further which predictors can be eliminated from the analysis, the regression coefficients for the standardized data are evaluated.Predictors with small absolute coefficients making a small contribution to the response can be eliminated from the model.Another statistic summarizing the contribution of a variable to the model is the variable importance for projection (VIP), which estimates the relative importance of each independent variable X in fitting both predictors and responses and thus is often used for variable selection (Abudu et al., 2010).
The VIP scores are defined as (2.3) It is assumed that there are k latent variables selected from p predictor variables (X j ).A regression model based on PLS was built using the latent vectors obtained from X to predict Y .As shown in (2.3), W ij represents the loading vector between the ith latent variable and the independent variable X j .SS i implies the response variance explained by the ith latent variable when a PLS model is developed.The ratio of the variance explained by X j to the total variance implies the relative influence of each predictor variable on the total variance The VIP score is obtained when the number of the predictor variables is multiplied by the influence of each predictor variable (Han and Kim, 2003).Therefore the VIP coefficients reflect the relative importance of each predictor variable in fitting both predictors and responses.The larger a VIP score, the greater the contribution of the associated predictor variable provides to the PLS model (Han and Kim, 2003).
Predictors with relatively small VIP coefficients (in absolute value), less than 1 (one), are considered to have small contribution to the prediction and might be excluded from the model; predictors with VIP scores close to or greater than 1 (one) can be considered the most relevant for explaining Y .This is called "greater than one rule", and is used as the criterion for variable selection (Han and Kim, 2003;Chong and Jun, 2005).

The LASSO, Forward Stagewise and LARS Methods
There are two main reasons why the data analyst is not satisfied with the OLS estimates based variable selection methods: 1. prediction accuracy, the OLS estimator often is best linear unbiased but with high variance.Better prediction performance can be achieved by sacrificing a little bias.Because the small increase in bias can be traded by a larger reduction in variance, resulting in a more desirable estimator overall.2. interpretation; analysts often wish to determine smaller subset of large number of predictors, which has the strongest effect.Subset selection can be extremely variable, important repressors may drop in any step, small changes in data produce different models; therefore it destroys the prediction accuracy.Shrinking and setting some coefficients to zero, help us to have prediction accuracy or small subset of predictors.Although ridge regression, principal component and PLS based methods provide more stable models, they do not set any coefficients to zero, and does not provide us interpretable models.
LASSO proposed by Tibshirani (1996) is a popular technique for model selection and estimation in linear regression models.It employs an L1-type penalty on the regression coefficients which tends to produce sparse models, and thus is often used as a variable selection tool as in Tibshirani (1997) and Osborne et al. (2000).Knight and Fu (2000) studied the asymptotic properties of LASSOtype estimators.They showed that under appropriate conditions, the LASSO estimators are consistent for estimating the regression coefficients.It has been demonstrated in Tibshirani (1996) that the LASSO is more stable and accurate than traditional variable selection methods such as best subset selection.Efron et al. (2004) proposed the LARS, and showed that there is a close connection between the LARS, the LASSO, and the Forward Stagewise regression.Each of these procedures involves a tuning parameter that is chosen to minimize the prediction error.
Consider the common Gaussian linear regression model.LASSO estimate is the solution to min where t ≥ 0 is a tuning parameter.An alternative formulation of the LASSO is to solve the penalized likelihood problem Both formulations are equivalent in the sense that, for any given λ ∈ [0; ∞), there exists a t ≥ 0 such that the two problems have the same solution, and vice versa.Tuning parameter can be chosen based on Mallows' C p , AIC and BIC.The LASSO is a constraint version of the OLS estimates.It shrinks some coefficients and sets rest of them to zero, also continues with the useful properties of both subset selection and ridge regression.The forward stagewise regression is an iterative procedure, where successive estimates are built via a series of small steps.Letting η 0 = Xβ, and beginning with η0 = 0 is the current estimate, the next step is taken in the direction of the greatest correlation between covariate X j and the current residual.
The LARS is a new model selection algorithm which is a useful and less greedy version of traditional forward selection or forward stepwise regression.It uses mathematical formula to speed up the computations.The number of the covariates is as the same as the number of steps that are required for the full set of solutions.The LARS algorithm starts with all coefficients equal to zero, then finds the predictor that has the largest correlation with the response variable, say X j1 and increases the coefficient in the direction of the sign of its correlation with the response.Next it finds residuals, stops when any other predictor, say X j2 has as much correlation with the residual computed for the first coefficient as does X j1 .The LARS proceeds in a direction equiangular between these coefficients X j1 and X j2 until the third variable X j3 , has the most correlation with the residual.Then LARS proceeds equiangular between these three coefficients, which is the "least angle direction" until the fourth variable enters the model.LARS procedure continues until all variables are in the model.Efron et al. (2004) showed that there is a close relationship among these procedures in that they give almost identical solution paths.

Analysis of Results
As we mentioned earlier, there is a need for more efficient and timely methods to diagnose such disease.Therefore it would be beneficial to take an action at early stage of the disease if model could be constructed to predict the late disease outcomes on day 10 (i.e., lung weight increase and C. pneumonia load) by using transcript variables from the average measurements in 16 groups in day 3.
There are two groups in day 3 and day 10 and two responses of different disease outcomes, lung weight increase % (Y 1 ) and lung C. pneumonia load (i.e., chlamydial lung burden) (Y 2 ) defined as average percentage of lung weight increase (Y 1 ) and the log 10 C. pneumonia / lung (Y 2 ), respectively.It is worth noting that these two responses measured on day 10 have no significant correlation to each other (p-value = 0.8238).
According to the result of the Hotelling's T -square test to test if there was a difference between the mean vector of two responses in day 3 and day 10 (Wilk's Λ = 0.83280946, p-value = 0.0705), the late disease outcomes can be predicted from the early transcript variables, which means a model can be constructed by using day 3 transcript variables to predict the late disease outcomes on day 10.Therefore, in our study, the 23 independent variables are chosen from day 3 and both dependent variables stand for the values on day 10.As a result of this, we are interested in determining which early transcript variables on day 3 would be contributing to the prediction of the average late disease outcomes (i.e., the means of lung weight increase on Day 10 and lung C. pneumonia load on Day 10, respectively).We used PROC GLMSELECT in SAS 9.2 for the results of stepwise, LASSO and LARS methods.

Variable Selection Results Based on Stepwise Regression Methods
The stepwise regression was attempted in this study, transcript variables X 4 , X 5 and X 13 were selected to construct a model to predict Y 1 on day 10 and variables X 5 , X 8 , X 14 were selected to build model to predict Y 2 on day 10 (Table 1a and 1b).
In a forward selection, with specified the number of candidate effect that was entered in the model sequentially, specific subset was determined.For example, if the number of selection steps was set at 7 or 8, variables X 1 , X 2 , X 3 , X 4 , X 5 , X 13 and X 21 were selected to construct model to predict Y 1 on day 10.Similarly, variables X 2 , X 5 , X 8 , X 14 , X 17 and X 19 were selected to build model to predict Y 2 on day 10.More variable selection combinations are shown in Table 1a and  1b.However, if the number was not explicitly specified, a "best" subset model will be determined by the Schwarz Bayesian Information Criterion (SBC).Add effects that give the lowest value of the SBC statistic at each step and stop at the step where adding any effect would increase the SBC statistic.Therefore, variables X 4 , X 5 and X 13 were selected to construct model to predict Y 1 on day 10 and variables X 5 , X 8 , X 14 were selected to build model to predict Y 2 on day 10.
Stepwise Forward # LASSO & LARS *# PLS Symbols: X: Include all and find the best, *: The best model with 5 or 6 variables, #: The best model with 7 or 8 variables, S: PLS.
Symbols: X: Include all and find the best, *: The best model with 5 or 6 variables, #: The best model with 7 or 8 variables, S: PLS.

Variable Selection Results Based on PLS Method
According to "greater than 1 rule", the variables X 1 -X 6 , X 9 and X 12 -X 23 with small VIP scores in the PLS model 1 are excluded and the variables X 7 , X 8 , X 10 , X 11 , and X 13 are selected to build model 1 to predict Y 1 (Figure 1(a)).According to the VIP scores plot (Figure 1(b)), variables X 1 -X 3 , X 6 , X 8 , X 10 , X 11 , X 13 , X 15 -X 18 and X 20 -X 22 have small VIP values in the PLS model 2 and variables X 4 , X 5 , X 7 , X 9 , X 12 , X 14 , X 19 and X 23 are selected to construct model 2 to predict Y 2 .

Variable Selection Results Based on LASSO Method
Since we obtained identical models for the forward stagewise and LARS, we will provide the results from the LASSO method.If the number of selection steps was set at 7 or 8, variables X 1 , X 3 , X 4 , X 6 , X 8 , X 13 , X 16 and X 20 were selected to construct model to predict Y 1 on day 10.Similarly, variables X 2 , X 5 , X 7 , X 8 , X 9 , X 12 , X 14 and X 16 were selected to build a model to predict Y 2 on day 10.Without the specified the number of selection steps, X 8 and X 13 were chosen in model Y 1 and X 14 was selected in model Y 2 .More variable selection combinations are shown in Table 1a and 1b

Summary Results of the Variable Selection Methods
Utilizing statistical methods like stepwise regression, LASSO and PLS-VIP, several variables have been selected with specified different selection and stopping criteria.Table 1a and 1b summarizes different combination of variables selected from stepwise regression, PLS method in both model 1 and model 2 with specified number of candidate effects or selection steps.The four variables X 4 , X 5 , X 8 , X 13 are the "most" selected variables from these methods to construct model Y 1 , and the four variables X 2 , X 5 , X 8 , X 14 are the "most" selected variables from these methods to construct model Y 2 .We have also detected one outlier (7 th observation) with moderately high residual (studentized residual = 2.985) based on the fitted model of Y 1 vs. X 4 , X 5 , X 8 , X 13 .However this observation did not affect the results of the statistical analysis, therefore we included this observation in our data.
The value of various all-possible-regression selection criteria are shown in the Table 2 with these variables and the "Criterion Panel" in Figure 2 provides a graphical view of the evolution of these fit criteria as the "most" selected variables entered in the model sequentially.Good linear models are achieved with small values of AIC (Akaike's criterion, Akaike, 1969, Darlington, 1968), SBC (Schwarz's Bayesian criterion, Schwarz, 1978), BIC (Sawa Bayesian information criterion, Sawa, 1978), AICC (a small sample bias corrected version of AIC, Hurvich and Tsai, 1989) and C p (Mallows, 1973) and a high adjusted R 2 value (Darlington, 1968) close to 1.

Conclusions
In this article, various variable selection methods such as stepwise regression, forward selection, PLS and LASSO were employed to fulfill the preselection step.At this step, for example, 23 descriptors were reduced to 3 by stepwise regression and forward selection, 2 by LASSO without specified number of variable effects, and 5 by PLS-VIP in model 1.These preselected variables served as a starting pool for the comparison of variable selection methods.After the step of preselection, the "most" variables are selected from these methods to construct models and good linear models are achieved with small AIC, SBC, BIC, AICC and C p and a high adjusted R 2 value close to 1.
Although stepwise regression methods are often used for variable selection due to their simplicity, it cannot overcome the over-fitting in our case because of the variables outnumbering the observations and high correlation among the predictor variables.Besides, forward regression method yielded unsatisfactory results due to rank deficiency problem.To overcome the multicollinearity problem, a possible solution is to use only a subset of the predictor variables, where the subset is chosen so that it does not contain multicollinearity problem.Numerous subset selection methods are available.In this paper, PLS-VIP method, LASSO and LARS were employed to circumvent such problem.Utilizing the statistical methods like stepwise regression, PLS-VIP, several variables were preselected and served as a starting pool.Simple model candidates were generated based on the preselected variable pool by employing several well-known model selection criteria in multiple regressions.The researchers of this study are concerned about the primary role of cellular markers and inflammatory regulators in C. pneumonia disease regulation.Transcript variables Tim3, GATA3, Lacf, Arg2 (X 4 , X 5 , X 8 and X 13 ) are the main variables of interest to study the C. pneumonia disease (lung weight increase) or C. pneumonia lung load outcomes.These excessive CD4 + T helper cells immunity (Tim3, GATA3), macrophage activation (Arg2) and an exaggerated contribution of innate immunity of the early response of C. pneumonia (Lacf) are the key determinants in precipitation of late disease.IL-6 (X 14 ) as a critical regulator for amplifying inflammation had an enhancing effect on chlamydial growth in vitro.Studies also showed that blockade of IL-6 trans-signaling corrects hyperinflammation and increases chlamydial load in Dusp1 −/− mice (Rodriguez et al., 2005(Rodriguez et al., , 2010)).As a hallmark of B-cells, CD19 (X 2 ) promotes the proliferation and survival of mature B cells.The studies show that B cells play an important role in the initiation of T cell responses to Chlamydia trachomatis (Mouse Pneumonitis) lung infection and pulmonary chlamydial infection and related to impaired cytokine production (Ramsey et al., 1988;Yang et al., 1998).Mutations in CD19 are associated with severe immunodeficiency syndromes characterized by diminished antibody production.CD19 has been used to diagnose cancers that arise from this type of cell notably B-cell lymphomas and also been implicated in autoimmune diseases and may be a useful treatment target (van Zelm et al., 2006(van Zelm et al., , 1995;;Fujimoto et al., 2007).However, there is no direct evidence shown that CD19 has a strong impact on chlamydial infection.The role of CD19 in the chlamydial growth is less well understood.
Models including these key variables may provide possible answers to the problem of molecular mechanisms of chlamydial pathogenesis.It is worth noting that variable GATA3 (X 5 ), which is necessary and sufficient for Th2 cytokine gene expression in CD4 T cells, is the most frequent variable selected from multiple statistical methods, suggested a primary role in immune responses to regulate the C. pneumonia disease.This result is consistent with the studies that Th2 immunity is a required component to control inflammation elicited by the Th1 component of the anti-chlamydial immune response (Wang, 2005).
Therefore, a better understanding of the regression relationship can be reached through a reduction of the number of regressors required for effective predication of the regressand.Since fewer variables are involved, further studies involving such regression relationship may be easier and less expensive to perform and researches can better focus on the variables of interest.

. 21 Figure 1 :
Figure 1: VIP scores plot of PLS model for Y 1 (a) and Y 2 (b)

Figure 1 :
Figure 1: VIP scores plot of PLS model for Y 1 (a) and Y 2 (b)

Table 1a :
Summary of variable selection from multiple statistical methods in model Y 1

Table 1b :
Summary of variable selection from multiple statistical methods in model Y 2