Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

This paper discusses the selection of the smoothing parameter necessary to implement a penalized regression using a nonconcave penalty function. The proposed method can be derived from a Bayesian viewpoint, and the resultant smoothing parameter is guaranteed to satisfy the sufficient conditions for the oracle properties of a one-step estimator. The results of simulation and application to some real data sets reveal that our proposal works efficiently, especially for discrete outputs.


Introduction
A crucial problem in statistical modeling is variable selection, which affects the accuracy of inferences.This is especially important for the selection of explanatory variables of regression analysis of recent huge data sets.Many methods for variable selection in regression have been developed and summarized in standard textbooks, such as Chatterjee, Hadi and Price (2000).Cross validation (CV), AIC and BIC are known to be convenient techniques, not only for variable selection in regression but also for general model selection.However, when the set of variables is large, these techniques have a high computational cost, since all combinations of the variables must be calculated.The LASSO type method can overcome this difficulty by its simultaneous implementation of estimation and variable selection (see Tibshirani, 1996).Progress on using this simultaneous implementation includes Fan and Li (2001), in which nonconcave penalty functions play an important role.Fan and Li (2001) show such estimators possess the oracle properties.Zou and Li (2008) discuss using local linear and quadratic approximations to avoid the singularities of nonconcave penalty functions and show that the obtained estimator possesses the oracle properties as well.
On the other hand, penalized regression approaches generally include a smoothing parameter.Because the estimates obviously depend on the actual value of the smoothing parameter, the selection of the smoothing parameter ultimately affects variable selection.This paper presents a simple selection method for the smoothing parameter used in nonconcave penalized likelihood models.Zou and Li (2008) utilized a conventional 5-fold CV to determine the smoothing parameter included in the model.However it is not clear that one-step estimates obtained through the model with a smoothing parameter determined by 5-fold CV preserve the oracle properties.We propose a simple and effective method for the selection of the smoothing parameter and show that the obtained smoothing parameter satisfies the sufficient conditions for the oracle properties of one-step estimates.The method is developed by using an appropriate prior of the parameter in the model, the idea of which has been discussed in the field of model selection in mixed models (Ruppert, Wand and Carroll, 2003;Yoshida, Kanba and Naito, 2010).
The paper is organized as follows.Section 2 gives a summary of the one-step estimator discussed in Zou and Li (2008).In Section 3, our method for selection of the smoothing parameter is proposed.Practical examples of regression models and functions to be optimized are shown in Section 4. Section 5 reports the results of applying the proposal to real data sets as well as simulation studies.It will be seen that our proposed method is especially efficient for models with discrete output, such as binary and Poisson models.Final discussions are contained in Section 6.

One-Step Estimator
Let {(y i , x i ) | i = 1, • • • , n} be a data set, where x i is the p-dimensional explanatory variable and y i is the scalar response variable.Assume that y i depends on x i through a linear combination x T i β and has the density f (y i ).The conditional log-likelihood given x i is assumed to be expressed as i (β, φ), where φ is a dispersion parameter.In the setting of the generalized linear model, φ is the variance of the error for the linear model, and we can take φ ≡ 1 for the logistic and Poisson models.For simplicity, we use (β) to stand for the loglikelihood (β, φ) = n i=1 i (β, φ).Zou and Li (2008) considered the variable selection method of maximizing the penalized log-likelihood function where p λn is a nonconcave penalty function, such as SCAD and L q , and λ n is the smoothing parameter.In the special case of a separable penalty function p λn (•) = λ n p(•), for smoothing parameter λ n and some function p(•), we can obtain estimates of β by maximizing with respect to β.In fact, since maximizing (2.1) and (2.2) is challenging by its singularity involved in p(•), we aim to utilize an approximation of the penalized log-likelihood.That is, using the Taylor expansion of the log-likelihood around MLE and the local linear approximation (LLA) of the penalty function with respect to β. Zou and Li (2008) then defined OSE as follows: Here, let β 0 = β T 10 β T 20 T be the true parameter and let β 20 = 0. Similarly, let β(ose) = β(ose) T 1 β(ose) T

2
T , corresponding to division of true parameter T .Also, let I(β 0 ) be the Fisher information matrix and let where I 1 (β 10 ) is a submatrix of I(β 0 ) corresponding to β 10 .In the special case of a penalty function which separates into a smoothing parameter λ n and a function p(•) satisfying: OSE has oracle properties (Zou and Li, 2008): (a) Sparsity: Zou and Li (2008) used 5-fold CV to determine λ n .However, the λ n determined by 5-fold CV does not necessarily always satisfy Condition 3, and the computational cost of implementing 5-fold CV itself is not cheap.A simple and reliable selection of smoothing parameter λ n is proposed in the subsequent discussion.

Selection of Smoothing Parameter
In this section, when p λn (•) = λ n p(•), we show that λ n is expressed in terms of the parameter of a prior distribution for β.

Decision of Smoothing Parameter
Assume that the prior distribution of where θ is a scalar parameter, d(θ) is the normalizing constant and T (•) is a measurable function.Then the log of the joint density of (y where f (y i ) is the density function of y i given β.We note that maximizing this formula with θ = 1/(nλ n ) and T (β j ) = −p(|β j |) is the same as maximizing the penalized likelihood (2.2).Hence, using the parameter θ, the smoothing parameter λ n can be expressed as (3.1)

Conditions for Smoothing Parameter and Penalty Function
Here, we show that the OSE with the smoothing parameter (3.1) satisfies the oracle properties.If Conditions 1-3 hold, then the OSE has the oracle properties.Conditions 1 and 2 are concerned with the penalty function, and Condition 3 is for the smoothing parameter.Hence, we verify whether the smoothing parameter (3.1) satisfies Condition 3.
First, suppose θ is a constant.In this case for 0 < s < 1, the smoothing parameter (3.1) with a constant θ does not satisfy Condition 3 for 0 < s < 1.

Examples
In this section, we show some models which are included in our framework.

Linear Regression Model
For linear model y = Xβ + ε, the Hessian matrix is Hence, we can get the OSE by optimizing 3).We use MLE for φ.

Logistic Model
For a data set .
The Hessian matrix is Hence, we can get the OSE by optimizing

Poisson Model
For a data set where π(x i ) = exp(x T i β).Since the Hessian matrix is we can get the OSE by optimizing

Implementation
In this section, we report the results of applying the proposed OSE to real data sets.Simulation results are addressed as well.Also, we compare the models obtained by AIC, BIC, LASSO, OSE(CV) and our proposed OSE, where OSE(CV) refers to an OSE with λ n , minimized with respect to 5-fold CV with q = 0.01.

Real Data Sets
Here, we performed a comparison of estimation methods using Ozone, Diabetes, Parkinson, Glass and Wine data sets.We used the 5-fold CV value to compare the prediction ability (PA) of each method.The variables selected by each method are also compared.All results are tabulated in Table 1.
For the Ozone data set, we applied a linear regression model with quadratic interaction where (x i1 , x i2 , x i3 ) = (solar radiation i , daily maximum temperature i , wind speed i ) for i = 1, • • • , 111.We can observe from Table 1 that PAs of EST1, EST2, OSE(CV) and LASSO are larger than those of AIC and BIC.The 7th variable solar radiation × daily maximum temperature, should be the most important in this model since all methods selected it.All methods here except AIC and BIC selected the 5th variable, squared daily maximum temperature.This inclusion of the 5th variable might be the reason their PAs are worse.For the Diabetes data set, the PAs of AIC, BIC and OSE(CV) are smaller than those of the other methods.EST1, EST2 and LASSO performed similarly, with respect to PA.It is interesting that there is a significant difference between the PA of OSE(CV) and that of LASSO although they selected the same two variables.This might be the effect of the choice of q: for LASSO, q = 1, while for OSE(CV), q = 0.01.
For the Parkinson data set, all methods have similar PAs.The proposed OSE selected no variables although other methods selected several variables.This means that EST1 and EST2 both suggest "Random Guess" in this model.This might be an extreme solution, but it motivates us to look at the set of explanatory variables carefully.
For the Glass data set, the PA of EST2 is the smallest and EST2 selected only 3 variables from 9 explanatory variables.This means that EST2 achieves a precise prediction using only 3 variables.
For the Wine data set, there is little difference among PAs, except for EST1.BIC selected just two variables and has the best PA.EST1 and EST2 selected only a few variables although OSE(CV) and LASSO selected all variables.From the sets of selected variables and the PAs, one can observe that the behavior of EST2 is similar to that of AIC in the case of the Poisson model.
For the Parkinson, Glass and Wine data sets, there is little difference among the PAs of the methods.On the other hand, the PAs of AIC and BIC are better than other methods for the Ozone and Diabetes data sets using a linear model.As expected, AIC and BIC performed well with linear models.Note that the sets of variables selected by EST1 and EST2 were similar to those selected by AIC and BIC.For logistic and Poisson regression models with discrete output, there was little difference between the PAs of the methods, but EST1 and EST2 selected only a relatively small number of variables.Hence it can be claimed that the proposed OSE is efficient for models with discrete output.
The proposed method does not require a grid search for the smoothing parameter and it can select variables at the same time as it estimates β.Hence, the proposed method can be executed at little computational cost compared to AIC and BIC, which are computationally costly because they compute for every combination of the variables.This is an attractive point for the use of the proposed OSE.

Simulation
Here, to see the behavior of proposed OSE, we carried out simulations.First we will explain the structures of the simulations.
We utilized the following regression models: Covariates and errors are generated according to ∼ N (0, 1), where the (i, j)-component of Σ is 0.5 |i−j| .Here we will use the model error for the constructed prediction μ(•) based on the variable selection, where the expectation is taken with respect to a new observation x.The simulation algorithm is as follows: 3. For the estimates of β obtained by each method, calculate μ(•).
4. Evaluate the prediction ability (PA) by minimizing the 5-fold CV value and calculate the ratio of model error: .
5. Calculate Correct, the number of nonzero coefficients correctly estimated to be nonzero and Incorrect, the number of zero coefficients incorrectly estimated to be nonzero.
6. Repeat steps 1 through 5 a thousand times, and then calculate the MRME (the median of RME), PA (the average of the 5-fold CV), C (the average of Correct) and IC (the average of Incorrect).
For Model 1, C = 2 means the method can select two important nonzero variables and IC = 0 means the method can delete all unnecessary variables.Tables 2 and 3 show MRME, PA, C and IC for the linear regression model.Similarly, Table 4 shows these values for the logistic regression model, and Tables 5 and 6 show these values for the Poisson regression model.We compare these values for the different methods.
In Tables 2 and 3, both the MRME and PA of BIC are the smallest of all the methods for both Models 1 and 2. The MRMEs of the proposed OSEs are larger than those of other methods and their PAs are also somewhat large.We can observe from the C column of Tables 2 and 3 that all methods could select the important nonzero variables for both Models 1 and 2, because C for all methods was near to 2 for Model 1 and equal to 3 for Model 2. The values of IC in Tables 2 and 3 show that the proposed OSEs could frequently delete the unnecessary variables, since the ICs of the proposed OSEs are near to 0, but the ICs of the other methods are slightly larger than those of the proposed OSEs.The overall impression from considering Tables 2 and 3 is that the proposed OSE could select important variables as well as other methods, and could delete unnecessary variables more often than other methods.Hence it might be claimed that the proposed OSE is accurate in the sense of avoiding false positives, see Bühlmann and Meier (2008).Though its ability to predict and to fit data is not satisfactory, the proposed OSEs performed well at variable selection.As shown in Table 4, there is little difference in MRME among the methods, for both Model 3 and Model 4.There was also little difference in PA.Hence, the fit and PA of methods were nearly the same.According to C and IC in Table 4, the proposed OSE with θ 2 (n, α 1 (0.49)) and θ 2 (n, α 2 (0.99)) selected no variables.However according to C of Model 3, the proposed OSE with θ 2 (n, α 1 (0.01)) and θ 2 (n, α 2 (0.01)) could select two important variables, similar to AIC and BIC, but BIC would appear to attain the best balance of C and IC values.For Model 4, only the C of LASSO is near 3, so these methods could not select three important variables.The values of C of the proposed OSE with θ 2 (n, α 1 (0.01)) and  According to Tables 5 and 6, the MRMEs of the proposed OSE are comparatively small for both Model 5 and Model 6.The MRME of the proposed OSE with θ 2 (n, α 1 (0.25)) is the smallest for Model 5, and the MRME of LASSO is the smallest for Model 6.Also, there is little difference in the PAs of the methods for Model 5.For Model 6, the PA of the proposed OSE and OSE(CV) are relatively small.The PAs of LASSO, AIC and BIC are larger than the proposed OSE, which means that the proposed OSE has a good PA.For Model 5, the Cs of the proposed OSE, OSE(CV), AIC and BIC are near to 2 and the ICs of the proposed OSE and BIC are near to 0. The proposed OSE with θ 2 (n, α 1 (0.25)) seems to be the best for both n = 100 and n = 200 in that it has the best balance of C and IC.We can see that AIC and BIC could not select the important variables well, because for Model 6, while the ICs of AIC and BIC are also near to 0, so are the Cs.For the Poisson regression model, the proposed OSE has a good fit and PA, and it can select the set of important variables well.Results of the simulations look similar to those from applying the methods to real data sets.For linear regression models, the PAs of AIC and BIC are better than other methods.For logistic and Poisson regression models, the PAs of all methods are almost the same, but the proposed OSE performed variable selection better than other methods.

Discussion
We consider the penalized log-likelihood with an L q penalty as the MAP using the prior of β in Section 3.1.We consider how the form of prior of β affects the selection of the smoothing parameter?To determine this, we focus on the variance matrix of the prior of β, calculated by where Γ(•) is the gamma function and I p is the p × p identity matrix.If the variance component C q (θ) is small, then this means that the prior of β is tightly distributed around the mean 0; hence β ≈ 0, and the probability that many variables will be deleted from the model can be expected to be high.In this sense, the proposed method may tend to be strict with respect to selecting variables.
We discuss here the effect of the variance of the prior and the smoothing parameter on variable selection.Roughly speaking, the sum of values of C and IC in Tables 2-6 is the number of variables selected by each method.In the simulation results of the Poisson regression for Model 5 tabulated in Tables 5 and  6, the number of variables (the sum of C and IC) selected by OSE with either θ 1 (200)(q = 0.01) or θ 2 (200, α 1 (0.25)) is fewer than those with n = 100.On the other hand, the number of variables selected by OSE with either θ 2 (200, α 1 (0.49)), θ 2 (200, α 2 (0.5)) or θ 2 (200, α 2 (0.99)) is slightly larger than those with n = 100.
We can understand these different results by considering the variance component of the prior distribution.For θ 1 (n)(q = 0.01) and θ 2 (n, α 1 (0.25)), the variance components decrease drastically as n grows from n = 100 to n = 200; this change of variance has a more serious affect than the decrease of the smoothing parameter, hence the number of variables selected is decreasing.
In this sense, variable selection is affected not only by the value of the smoothing parameter but also by the variance of prior distribution of β.
In the case of discrete output (binary and Poisson), variable selection by ordinary methods, such as AIC, BIC and LASSO, were not satisfactory according to our simulation.This is because discrete output can generally be explained by a smaller number of explanatory variables than continuous output.AIC type methods are aimed at prediction, which means that these methods tend to include many variables in the model.A set of variables selected by AIC might include redundant variables for prediction, which explains the results seen in Tables 5  and 6.Bühlmann and Meier (2008) and Zhang (2008) proposed new methods of variable selection corresponding to high-dimension low sample size (HDLSS) problems where p n, which are called the MSA-LASSO method and the MC+ method, respectively.MSA-LASSO can be regarded as an adaptive way to search for the best initial estimator in the one-step paradigm.These authors pointed out that the number of false positives is perhaps more important than reducing prediction errors in high-dimensional data analysis.As a method for having a low probability of false positives, the proposed OSE in this paper would be worth applying to the HDLSS setting.

Table 1 :
Data sets and results of comparison

Table 3 :
Linear regression model (n = 200)(n, α 2 (0.01)), OSE(CV), AIC and BIC are all similar, BIC having the smallest IC.For Model 4, the values of C of the proposed OSE are similar to those of OSE(CV), and the values of IC for the proposed OSE are smaller than the value of IC for OSE(CV).Hence we can conclude that the proposed OSE perform better than OSE(CV) in the models simulated.