A Bimodal Spike and Slab Model for Variable Selection and Model Exploration

: We have developed an enhanced spike and slab model for variable selection in linear regression models via restricted ﬁnal prediction error (FPE) criteria; classic examples of which are AIC and BIC. Based on our proposed Bayesian hierarchical model, a Gibbs sampler is developed to sample models. The special structure of the prior enforces a unique mapping between sampling a model and calculating constrained ordinary least squares estimates for that model, which helps to formulate the restricted FPE criteria. Empirical comparisons are done to the lasso, adaptive lasso and relaxed lasso; followed by a real life data example.


Introduction
We consider the problem of selecting variables in a linear regression model.To outline the problem mathematically, we consider the linear regression model where the responses Y 1 , • • •, Y n are independent with corresponding K-dimensional predictors x 1 , • • • , x n .The {ε i }'s are independent variables such that E(ε i ) = 0 and E(ε 2 i ) = σ 2 > 0. The dilemma is to find the subset of nonzero covariates from β = (β 1 , β 2 , • • • , β K ) t .In this article it is also assumed that x i are standardized so that i=1 x i,k = 0 and i=1 x 2 i,k = n for each k.The purpose of this article is to use final prediction error (FPE) (Akaike, 1969) criteria for variable selection via a restricted model search mechanism based on Bayesian hierarchical model.
FPE is a criteria which takes the residual sum of squares (RSS) and tacks on a penalty related to the number of variables.To properly define FPE for a subset α ⊆ {1, • • • , K}, let β α be the components of β = (β 1 , • • • , β K ) t which are indexed by the elements of α.The formal definition of FPE criteria based on the linear regression model ( 1) is: (2) In (2), RSS(α) is the RSS for model α when the first K α variables are entered in the model, K α is the size of the model α, σ2 is the unbiased estimator for σ 2 and λ n is the non-zero penalty associated with the model.Model selection based on FPE chooses the model of size K α which minimizes (2) with respect to α. Classic examples of the FPE criteria are AIC (Akaike, 1973) and BIC (Schwarz, 1978).Note that, expression (2), setting λ n = 2 generates the AIC criterion and the BIC criterion if λ n = log(n).
In general to implement FPE variable selection approach; we need to visit all 2 K models and then apply the criteria to select the best subset of variables.Unfortunately this is not viable in high dimensional setting as it becomes computationally infeasible even for relatively small K. Instead we use a Bayesian approach based on rescaled spike and slab models (Ishwaran andRao, 2003, 2005 a, b); a variable selection method for linear regression models, which is rooted in the spike and slab models (George and McCulloch, 1993).Using this model we design a Gibbs sampler which allow us to draw values from the Bayesian posterior.This Gibbs sampler is highly efficient and is capable of effective search across the relevant model space which results in a restricted all subset search.Using these models we can implement a restricted FPE variable selection technique.
Another goal is to explore model space based on Bayesian hierarchical models.Model space exploration is extremely important from a machine learning perspective, especially in high dimensional setting.Therefore without executing any type of variable selection technique, we are able to locate variables that are vital in explaining the data.
The restricted FPE variable selection approach is solely driven by the proposed Bayesian hierarchical models; therefore, it is vital to study the impact of this model for searching the entire model space.By model space exploration we mean the following: the Gibbs sampler samples posterior values in each draw to produce desired model for performing restricted FPE analysis.These values also drive the posterior model probability, a higher probability value results in a larger model and vice versa.If Gibbs sampler samples only smaller models or only larger models, then resulting estimator will be underestimated or overestimated respectively.To analyze this issue, we theoretically study the performance of the proposed model in exploring the model space on the basis of Bayesian model averaged (BMA) estimator.Theoretical results show that Gibbs sampler based on our proposed model; samples all possible model sizes required for exploring the model space.Graphical tools are useful especially in high dimensional setting, to pinpoint the significance of variables in a given data set.We present a graphical tool to do so.
The article is organized as follows: Section 2 reviews the spike and slab hierarchical models and proposes a new bimodal prior in the spike and slab hierarchy.Also discussed is how this special type of hypervariance structure of the prior helps to compute FPE estimates for several models.Section 3 discusses the interplay between the posterior mean which is the BMA estimator and model space exploration.Section 4 introduces rescaled spike and slab models (termed the "bimodal spike and slab model").The remainder of this Section talks about how this model overcomes certain limitations of earlier proposed model in Section 2. In Section 5; a simulation study compares our proposed method with other popular model selection methods from the perspective of model space exploration; another simulation study is done to study the performance of the proposed model in the context of high-dimensional sparse linear model.Section 6 has real life data analysis using the R package modelSampler, developed based on our proposed model.The article concludes in Section 7. The proof of all lemmas and theorems of this article are placed in the Appendix.

A New Spike and Slab Model
Spike and slab models (Ishwaran and Rao, 2005b;George andMcCulloch, 1993, 1997;Mitchell and Beauchamp, 1988) are a popular methodology for selecting variables in the regression set up in (1).A spike and slab model defined in Ishwaran and Rao (2005b) is a Bayesian model specified by the following prior hierarchy: ( Here 0 is the K-dimensional zero vector, Γ is the All through this article the prior π for γ is specified as: where δ v (•) is a discrete measure concentrated at the value v.This prior for γ is in essence a two-component mixture model with one component taking on a very small value v 0 > 0, and the other component taking on a very large value V > 0. It induces a prior for β, which is a mix of near-degenerate multinormal distributions concentrated on a submodel α.An important feature of this two-component prior is that it has a selective shrinkage property.Selective shrinkage is a property where the posterior mean shrinks towards zero only for coefficients that are truly zero.For theoretical justification of this feature for a two-component prior see Ishwaran and Rao (2011).Another important feature of (4) is the presence of w, which we label as "complexity parameter", that controls the size of the models.Note that using an indifference prior is equivalent to choosing a degenerate prior for w at the value of 1/2.Using a continuous prior for w, therefore allows for a greater amount of adaptiveness in estimating model size.
As stated earlier; the purpose of this article is to explore model and at the same time; perform a restricted FPE analysis for variable selection.The implementation of the FPE analysis is done using the following procedure.
Running the Gibbs sampler, we track the different α models as they are being sampled.The model α is identified by introducing the following notation: Defining Thus, each draw for γ has an associated binary K-tuple (I 1 , • • • , I K ).Therefore the model α is the model where α = {k : I k = 1}.This creates a unique mapping between γ and α.Once we have the model α, we calculate the constrained OLS estimates of that model based on the indices of α, and then calculate the RSS for the model.Computation of AIC and BIC values are carried out using (2), which allows us to study the empirical performance of methods like AIC and BIC without being restricted to the number of predictors.This is greatly beneficial since classical implementations of an all subsets search (such as leaps-and-bounds (Furnival and Wilson, 1974)) is typically restricted to a handful of predictors, such as K = 30.The steps for calculating the FPE criteria are as follows.If the Gibbs sampler visits a model α, we calculate the constrained OLS of model α as where +α indicates the set of non-zero elements in α and I +α is an identity matrix of order (+α × +α).The corresponding RSS of that model will be RSS(α) = ||Y − X +α β+α || 2 .Once we have the RSS(α), we calculate the FPE defined in (2).

The Effect of Model Exploration on the Posterior Mean
This section shows in what ways model exploration can affect the BMA estimator, which is the posterior mean.Our discussion is based on the models (3) and (4).To define BMA formally, let β be the BMA estimator of β, then where µ α = E(β|α, Y), is the conditional posterior mean under model α and Pr(α|Y) is the posterior model probability of model α.
The BMA estimators are often used in Bayesian paradigm for model selection as BMA is a stable estimator since averaging over all the models takes care of model uncertainty, therefore it has better predictive performance.From ( 7), we notice that the BMA is driven by the posterior model probability; a higher value of probability results in a better estimator, whereas a smaller value presents an estimator of inferior quality.We show that the conditional posterior mean is approximately equivalent to constrained OLS estimator under our proposed models.Therefore; BMA is driven by the posterior model probability and constrained OLS estimator.In ( 5) and ( 6), we show how to calculate constrained OLS estimators.Hence if the Gibbs sampler moves around a smaller model space, it results in exploring smaller models.In addition, both the posterior model probability and constrained OLS will produce poorer BMA estimate, whereas in bigger model space the Gibbs sampler executes better.Therefore the corresponding BMA estimator calculated based on the models visited by Gibbs sampler, can be used as a tool to determine the nature of variables over the model space.In fact, in Section 6 we will use BMA estimator to compare the performance of FPE variable selection techniques.

Posterior Model Probability under Model (3)
Here we study in great detail the posterior model probability under model (3) as it is vital for model exploration.The posterior probability for α is , where Pr(Y|α) = f (Y|β, α)f (β|α)dβ is the marginal integrated likelihood of α and Pr(α) is the marginal prior probability of model α.If we interpret Pr(α|Y) as an assessment of the importance of model α, then from a model selection perspective; the goal is to choose α for which Pr(α|Y) is largest.In the data analysis section (Section 6), we will present this probability for each variables in the model.
Theorem 1.Under the models (3) and ( 4), the posterior model probability of α is , where In describing the proposed bimodal model, V is set to a very large value, but the question is how large is this value?In Section 4, we come up with the exact value of V .In describing the Bayesian hierarchical model, we have mentioned that v 0 should be very small and V should be very large; so it is interesting to note the limiting form of the posterior model probability under the limits v 0 → 0 and V → ∞.Under our proposed model, the conditional posterior mean is approximately equivalent to constrained OLS estimator.The first Lemma will formally present this argument.The second Lemma will depict the limiting form of C(α).Combining both results, we present the limiting form of Pr(α|Y) in the Theorem 2.
where μα denotes the estimator with values μ+α along the α coordinates, and zero elsewhere, and μ+α = ( Theorem 2. As v 0 → 0 and V → ∞, along with the results of Lemma 1 and Lemma 2, the posterior model probability of Theorem 1 is of the following form , where μα and Ĉ(α) are as defined earlier.(Here " ≈ " stands for "approximately equivalent to").
The final form of the posterior model probability in Theorem 2; leads to a number of interesting issues with respect to model space exploration.The presence of a constrained OLS estimator; in the expression of the posterior model probability is noticeable.We calculate the constrained OLS based on the indices of the model.With a smaller model, the constrained OLS will be calculated based on the indices of that model, which results in introducing a smaller posterior model probability.However, we do not want to restrict ourselves to only smaller models; but would like to visit a variety of models.This is not possible because of the presence of V −Kα/2 ; appearing in Pr(α|Y) (posterior model probability) through C(α).In a diffuse prior; the probability will heavily penalize larger models, which results in a penalization of the posterior values.As a result, the Gibbs sampler based on (3) only visits smaller models.

Rescaled Spike and Slab Model: A Bimodal Spike and Slab Model
To resolve this setback; instead of using (3), we use a rescaled spike and slab model (Ishwaran and Rao, 2005b).This approach uses the same prior (4) for π, however, in place of Y i in (3), we use the rescaled values Y * i = σ−1 n 1/2 Y i .An appropriate adjustment to the variance is introduced to accommodate the rescaling.A rescaled spike and slab model defined in Ishwaran and Rao (2005b) is: Note that this hierarchy includes a factor n; in the variance of the response to compensate for the rescaling.Also, any effect of σ 2 is removed from this model because the Y i 's are rescaled by σ2 .For this reason, unlike in (3), the presence of σ2 is not needed.
By replacing Y by Y * and setting V = n, the limiting posterior model probability under the limit v 0 → 0 and (Here " ≈ " stands for "approximately equivalent to").Perfect cancellation of V and σ2 occurring in Ĉ(α), ensures that Ĉ * (α) is independent of V .Hence the issue of the unavoidable penalty term involved in the posterior model probability is resolved in (3).We present some interesting results related to the posterior model probability under the bimodal model (8).The following theorem shows an interesting rela-tionship between the posterior model probability ( 9) and a BIC-like penalization value.
Theorem 3.Under (8), maximizing Pr(α|Y * ) is approximately equivalent to minimizing the BIC-penalty The above theorem shows that there is an intimate association between the posterior for α and BIC-penalization, but to assume that model selection based on our method, is nothing more than a restricted all-subsets search guided by a BIC-like penalty, is incorrect and flawed.With the induction of the effects of the "true" complexity parameter (for details, see the proof of the Theorem 4 in the Appendix), we present further interesting results using the posterior mean in model ( 8).The following theorem demonstrates our viewpoint.
The presence of w 0 in Theorem 4 leads to an interesting discussion on model exploration.The expression in ( 12) is not a traditional BIC penalty, instead a modified term involving the adaptively estimated complexity!For instance, a small value 0 < w 0 < 1/2 will heavily penalize larger models, whereas a large 1/2 < w 0 < 1 value will penalize smaller models.The value w 0 = 1/2 leads to a BIC-like penalty, since, setting w 0 = 1/2 in (12) yields (11).
In fact, one can view (12) or (13) as a compromise between BIC and an adaptively estimate of the AIC penalty.This is another way to interpret the effect of w 0 .Setting w 0 = 1/2 in (12) or (13) eliminates the AIC penalty.On the other hand, as w 0 approaches 1, the AIC penalization term becomes negative and the overall penalization drops, which encourages larger models.The opposite occurs as w 0 becomes small, as w 0 approaches 0; it results in a positive and very large penalization; yielding small models.

Example 1
A simulation study is performed to compare the performance of this model exploration tool compared to other popular model selection methods.we are using "Breiman simulations" as described in Breiman (1992).In the simulations, data is generated with the following setting: the covariates x i are simulated independently from a multivariate normal distribution with E(x i,k ) = 0 and E(x i,j x i,k ) = ρ |j−k| , where 0 < ρ < 1 represent the correlation coefficient and the error i 's are i.i.d N (0, 1) variables.Three different conditions of ρ are considered: (i) uncorrelated design (ρ = 0), (ii) moderate correlated design (ρ = 0.5), and (iii) highly correlated design (ρ = 0.9).
Under this global condition, two different conditions are considered: (I) a model with moderate number of non-zero coefficients (n = 200, K = 100, there are 55 zero coefficients); and (II) a model with many zero coefficients (n = 800, K = 400, there are 295 zero coefficients).For model (I), non-zero coefficients are clustered into 9 different groups, each cluster consists of 5 coefficients.Value of these 5 coefficients remain unchanged and replicates across the entire clusters.For model (II), non-zero coefficients are clustered into 15 different groups, each cluster consists of 7 coefficients.Value of these 7 coefficients remain unchanged and replicates across the entire clusters.In all simulated models, coefficient values are adjusted by multiplying a constant by which the theoretical R 2 is 0.75.
The goal of this simulation study is to see how well our model exploration tool performs in comparison to lasso (Tibshirani, 1996), adaptive lasso (Zou, 2006) and relaxed lasso (Meinshausen, 2007).Relaxed lasso is an alternative to lasso; where a double optimization technique is implemented using two tuning parameters to improve variable selection performance, for details see Meinshausen (2007).The adaptive lasso is another modification to lasso, works different than relaxed lasso, adds a unique penalization to each coefficient on top of lasso penalization, see Zou (2006) for details.Most of the time these methods are used for model selection; here we use these methods from the perspective of model space exploration.For this reason, we focus on a couple of items: estimated model size ( k) and total number of incorrectly classified variables (Miss).Lasso computation is implemented by using R package lars; relaxed lasso is implemented by using R package relaxo.Tables 1 and 2 summarizes the simulation results.
Below is the summary of our findings from the simulation study (with reference to Tables 1 and 2).
Bimodal model always opts for smaller model.The lasso selects larger models and has a tendency to overfit the model.In certain situations; misclassification rate for bimodal model is lesser than in lasso and other methods.In most situations; relaxed lasso performs quite identical to lasso.In correlated setting, performance of adaptive lasso is inferior than lasso as far as misclassification rate In all different conditions, for bimodal model, based on the estimated model size, the number of correctly specified non-zero variables on an average; perform reasonably well with respect to other comparable methods.
Finally, it is seen that in certain circumstances; misclassification rate is higher for bimodal model than other methods as the model size is smaller compared to other methods.Here lies the trade-off between misclassification rate and model size: if the model size is bigger, misclassification rate is low and vice versa.

Example 2
Here we conduct another simulation study to investigate the performance of the proposed bimodal model using high-dimensional sparse linear model as used in Ing and Lai (2011) and Shao and Chow (2007).Two different sets of sample size and number parameters are considered in the model: (n = 50, K = 200) and (n = 100, K = 400).For each of the above condition, error standard deviations are taken as σ = 1 and σ = 0.1.It is assumed that only the first five covariates are non-zero and rest of them are zero.
The simulated data is generated as follows.The response Y i 's are generated as x n are independently generated from a multivariate normal distribution with mean vector 1 K (the Kdimensional vector of ones) and covariance matrix I K (an identity matrix of order K).Two different constraints on x i 's are considered: x i 's are fixed throughout the iterations and x i 's are random throughout the iterations.The β's values are taken as β = (3, −3.5, 4, −2.8, 3.2, 0, 0, • • • , 0) t .The error i 's are independently generated from N (0, σ 2 ).
Table 3 reports the performance of the bimodal spike and slab model in highdimensional sparse linear model.In all different situations, bimodal model shows quite similar trend in variable selection, correctly specifying non-zero variables and also in incorrectly specifying variables.There is not much difference noticed in variations of model properties.In all the different situations, bimodal model produces smaller model size.There are only five non-zero covariates in the model; from this study it is noticeable that the correctly identified non-zero covariates are approximately two on an average, which is fairly fitting, and the misclassification rate is also reasonably effective.It should be noted that in current model/variable selection techniques for high-dimensional sparse models, most of the time a twostage procedure has been used; first a variable screening technique is used to reduce the number of variables from the model; then a second method has been used to the reduced model for final model/variable selection.Here we don't use any screening method; we directly use the bimodal model for variable selection.This section applies our R software package modelSampler for variable selection by FPE method using the well known ozone data (Breiman and Friedman, 1985).The data set consists of 203 observations with 12 variables, each observation is a day.The outcome variable in this example is ozone emission, ozone.
Table 4 contains the variable selection result based on ozone data.The first column (variable names) is ordered by the BMA estimators.The second column represents the posterior inclusion probability values of those variables.If the goal is to perform variable selection based on the median model (Barbieri and Berger, 2004) (0.50 of "prob" value is the cut-off point for selecting a variable to be in the model), then according to the median model, only four variables are in the selected model.The last two columns indicate the AIC and BIC models.The table shows that six variables are selected by AIC whereas only three variables are selected by BIC.It is interesting to note that the BIC model is more conservative than the median model.
Table 5 helps to assess the importance of a variable.Variables in the first column of the table are ordered by the BMA estimators, as a result the ordering is quite stable.This indicates the best model of size k for each k in terms of R 2 value.This table allows one to weigh in the global importance of a given variable.For example, the variable "humid" is highly significant as it appears in the best model for any given size.Also the posterior inclusion probability for "humid" is 0.98, the largest value of all the variables.This result demonstrates the behavior of the variables over the model space.

Graphical Diagnostics
Here we graphically summarize the results obtained from the analysis using a graphical wrapper which is produced from the R package (see Figure 1).The wrapper consists of five distinct plots.The top leftmost plot represents a histogram of several estimated complexity parameters from 10,000 sampled models.A closer look at the plot indicates that the mode of the complexity value lies in Figure 1: Complete graphical analysis of Ozone data.Results are based on a 2,500 burn-in iteration and 10,000 sampled values between 0.33 to 0.56.The middle plot on the top represents FPE values as a function of the dimension of the models.The black line corresponds to minimum residual sum of squares, the red line is minimum AIC and the green line is minimum BIC.The rightmost plot on the top is a histogram of several dimensionality of the models visited by the Gibbs sampler.A model of size 3 to 5 is the most frequently visited model.This characterizes the fact that three to five variables are the most significant variables in the data set.The bottom leftmost plot is an interesting plot.Here we have plotted the variables of the sampled model with respect to the number of Monte Carlo iterations.This plot helps to re-evaluate the importance of variables implicitly.For example, humid is present in almost all sampled models, while dayweek hardly ever appears.Clearly humid is an important variable.The last plot on the bottom represents the probability of visiting a new model at each iteration.We see that it stabilizes very quickly as the line is almost flat as the number of iterations approaches 10,000; which graphically identifies the convergence of the Gibbs sampler.
These plots are highly effective in explaining the data.Model space exploration helps to understand the nature of variables in explaining the data.A closer look at the plots show strong evidence of our argument that without even implementing any model selection technique; we have a clear-cut idea of the relevant variables and the data dimensionality.The leftmost plot on the bottom is a perfect example in this situation.As we sample several models, it is clear from this plot, which all variables are all-encompassing of any model.

Discussion
In this article we propose a simple bimodal model under spike and slab hierarchy.From the model space exploration point of view; this model is appropriate as it has selective shrinkage property.We have discussed the interplay between conditional posterior mean and model space exploration.Theoretical justifications illustrate the asymptotic equivalence between conditional posterior mean and constrained OLS estimator, that results in implementing frequentist model selection approach.Another interesting feature of bimodal prior is that it associates with the popular g-prior introduced by Zellner (1986), see Dey (2008) for details.By conducting an extensive simulation study; we show that the bimodal model favorably prevails against lasso, adaptive lasso and relaxed lasso.In all simulations it generated smaller models with competing misclassification rate.The performance of the proposed method has also been tested in highdimensional sparse linear model via simulation study.The bimodal model shows quite reasonable performance by producing smaller models and with acceptable misclassification rate and correctly classifying non-zero covariates in the model.The proposed method is effective in both model space exploration and variable selection process by using both frequentist and Bayesian techniques.The R package modelSampler is available upon request.Therefore, µ α → μα , as v 0 → 0 and V → ∞.

Table 1 :
Simulation result based on model (I) of Example 1.The values used in the simulation are: sample size (n) = 200, total number of variables (K) in the model = 100, out of them 55 (55%) are zero coefficients.

Table 2 :
Simulation result based on model (II) of Example 1.The values used in the simulation are: sample size (n) = 800, total number of variables (K) in the model = 400, out of them 295 (74%) are zero coefficients.Reported criteria are k (averaged estimated model size) , Correct (average number of correctly classified variables), and Miss (average number of misclassified variables).

Table 3 :
Simulation result based on Example 2. Reported criteria are k (averaged estimated model size), Correct (average number of correctly classified variables), and Miss (average number of misclassified variables).

Table 4 :
Variable selection results based on Ozone data.Results are based on a 2,500 burn-in iteration and 10,000 sampled values

Table 5 :
Top model stratified by size of the model based on Ozone data.Results are based on a 2,500 burn-in iteration and 10,000 sampled values