Validation of Stepwise-Based Procedure in GAMLSS

One of the key features in regression models consists in selecting appropriate characteristics that explain the behavior of the response variable, in which stepwise-based procedures occupy a prominent position. In this paper we performed several simulation studies to investigate whether a speciﬁc stepwise-based approach, namely Strategy A, properly selects authentic variables into the generalized additive models for location, scale and shape framework, considering Gaussian, zero inﬂated Poisson and Weibull distributions. Continuous (with linear and nonlinear relationships) and categorical explanatory variables are considered and they are selected through some goodness-of-ﬁt statistics. Overall, we conclude that the Strategy A greatly performed.


Introduction
In the past decades, statistical regression models have been greatly improved with the development of extremely sophisticated models in order to deal with an increasing amount of complex datasets. Back in the days, Sir Francis Galton introduced the concept of regression toward the mean with his experiments on the size of the seeds of successive generations of sweet peas (Stanton, 2001). Since then, some well-known extensions based on the same concept were developed, such as the generalized linear models (Nelder and Wedderburn, 1972) and the generalized additive models (Hastie and Tibshirani, 1990).
However, depending on the complexity of the data in study, we may have to consider more flexible models that are able to explain not only the mean of the response (target) variable distribution, but in fact all of its parameters, i.e. a beyond mean regression model (Kneib, 2013). In this sense, the generalized additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos, 2005) seem to be a great alternative.
The GAMLSS framework involves a distribution for the response variable (that does not necessarily belong to the exponential family)  and may involve parametric linear and/or nonparametric smoothing terms when modelling any or all of the parameters of the distribution as functions of the explanatory variables . Within GAMLSS, any distribution parameter can be modeled as a function of explanatory variables and hence different regression structures might be selected for each of them. Alternative ap-proaches, including criterion-based, regularization and dimension-reduction validation methods are discussed in Stasinopoulos et al. (2017) and available in gamlss package (Stasinopoulos and Rigby, 2007) in R software (R Core Team, 2020).
The main used method to select the explanatory variables in each of the regression structures in the GAMLSS framework is a stepwise-based procedure performed in each of distribution parameters called Strategy A . Recent examples of its application can be seen in Ayuso et al. (2020), Righetto et al. (2019), De Bastiani et al. (2018), Ramires et al. (2018), Leroy et al. (2016), among others. Nonetheless, no formal studies regarding this methodology are presented in the literature apart from two quite simple and specific studies: considering one predictor only (Voncken et al., 2019) and in a specific distribution on the unit interval . Hence, the aim of this paper is to study the behavior of the Strategy A procedure within the gamlss package and validate it through a set of simulation studies, considering different response variable distributions (Gaussian, zero inflated Poisson and Weibull), structures (linear and nonlinear relationships between a parameter and explanatory variables) and sample sizes. It is noteworthy that these distributions were considered since they are commonly applied in a wide range of problems. Nonetheless, more complex distributions, characteristics and behaviors might be considered in future papers. This paper is organized as follows. In Section 2, we present the GAMLSS framework and the adopted strategy regarding model selection. All simulation studies are presented in Section 3. Finally, Section 4 ends the paper with some concluding remarks.

GAMLSS Framework
Consider Y a univariate response variable which follows a specific distribution D(y; θ), where θ is its vector of parameters (e.g. for a four parameter distribution θ = (μ, σ , ν, τ ) ). Mathematically, a GAMLSS can be written as where g k (θ k ) is a link function usually determined by the range of θ k (for further details, please check De Stasinopoulos and Rigby, 2007), X k is a design matrix, β k is the parameter vector associated to X k and each s jk function is a smooth nonparametric function (e.g. a P-spline. See Eilers and Marx, 1996;Eilers et al., 2015) of an explanatory variable x jk . If, for k = 1, . . . , p, J k = 0, i.e. if no smooth functions are fitted in model (1), then we have the fully parametric GAMLSS given by In order to estimate β k and also the parameters associated to the second term (say γ jk ) in model (1), we shall fix the smoothing hyperparameters λ and maximize the penalized loglikelihood function. An example for a four-parameter distribution model is given by where P jk is a symmetric matrix that may depend on a vector of smoothing parameters (Stasinopoulos and Rigby, 2007). The smoothing hyperparameters λ can be estimated through the penalized quasi likelihood method (Lee et al., 2006) and is implemented in the pb() function in the gamlss package (for further details, please check Rigby and Stasinopoulos, 2014). If the fully parametric GAMLSS is being considered, then we only have to maximize the first term in (3). Further, for survival data, which is characterized by censored data, the response variable of n-independent observations is defined by y i = min{log(t i ), log(c i )}, where t i and c i represent the times of failure or censure, respectively. Let δ i = 1 and δ i = 0 the indicator variable for uncensored and censored observations, respectively. Considering non-informative censoring, i.e., lifetimes and censoring times independent, we maximize (3) replacing its first term by The numerical maximization of (3) can be achieved in the gamlss (or gamlss.cens for survival data) package in R using a generalization of the Cole and Green (CG) algorithm (Cole and Green, 1992), the Rigby and Stasinopoulos (RS) algorithm (Rigby and Stasinopoulos, 2005) or even through a combination of both. Both methods are well described in Rigby and Stasinopoulos (2005), Stasinopoulos and Rigby (2007) and Stasinopoulos et al. (2017), in which the main difference is that RS algorithm does not use the cross derivatives of the log-likelihood. In this paper we are using only the RS algorithm (default in gamlss package) which is generally more stable and, in most cases, faster than CG .

Selecting Explanatory Variables
According to , there are currently 13 different functions (methods) in the gamlss package that may assist us to select different subsets of explanatory variables for each of the parameters of a given response variable distribution (boosting is a different approach from these techniques and is discussed in Mayr et al., 2012;Hofner et al., 2016).
As mentioned in Section 1, the most used methodology to select covariates in GAMLSS is the Strategy A, a stepwise-based procedure, that can be accessed in the gamlss package through the stepGAICAll.A() function. For a four parameter distribution (i.e. θ = (μ, σ , ν, τ ) ), the steps of this approach are described in Nakamura et al. (2017) and Stasinopoulos et al. (2017) as follow: 1. Use a forward selection procedure to select an appropriate model for μ, with σ , ν and τ fitted as constants. 2. Given the model for μ obtained in step 1 and for ν and τ fitted as constants, use a forward selection procedure to select an appropriate model for σ . 3. Given the models for μ and σ obtained in steps 1 and 2 respectively and with τ fitted as constant, use a forward selection procedure to select an appropriate model for ν. 4. Given the models for μ, σ and ν obtained in steps 1, 2 and 3 respectively, use a forward selection procedure to select an appropriate model for τ . 5. Given the models for μ, σ and τ obtained in steps 1, 2 and 4 respectively, use a backward selection procedure to select appropriate model for ν, 6. Given the models for μ, ν and τ obtained in steps 1, 5 and 4 respectively, use a backward selection procedure to select appropriate model for σ . 7. Given the models for σ , ν and τ obtained in steps 6, 5 and 4 respectively, use a backward selection procedure to select an appropriate model for μ and then stop. By the end of these steps, the final model may contain different subsets of covariates for μ, σ , ν and τ . The criterion used to add (or remove) a variable in each regression structure is based on the generalized Akaike information criterion (GAIC; Voudouris et al., 2012), which is given by GAI C(κ) = −2l(θ) + κ × df , wherel(θ) is the fitted log-likelihood (3), κ is the degree of penalty and df are the effective degrees of freedom of the fitted model. If κ = 2 or κ = log(n), GAIC reduces to the Akaike information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978), respectively.
Looking specifically to the GAIC definition, we may note that the selection of a given variable will depend on how much such variable increases the log-likelihood functionl(θ ) (consequently, decreasing the GAI C(κ) value), weighted by the increasing on the number of degrees of freedom.
In this paper we will consider both AIC and BIC criteria in order to select the best fitted models. Furthermore, we will also consider in our simulation studies a third penalty which is given by the average of AIC and BIC penalties, i.e., κ = [2 + log(n)]/2, denoted here as GAICav. This additional penalty is considered here since, as described in Hossain et al. (2016), AIC and BIC can lead to overfitting (undersmoothing) and underfitting (oversmoothing), respectively.

Simulation Studies
In this section we perform all simulation studies regarding the performance of the Strategy A approach to select explanatory variables for different regression structures in the GAMLSS framework. Three different response types that claim varied distributions with particular characteristics will be considered here (for further details regarding each of these distributions, please check Rigby et al., 2019): • Continuous response: Gaussian distribution, i.e. Y ∼ N(μ, σ ), where −∞ < y < ∞, −∞ < μ < ∞ is the mean and σ > 0 the standard deviation parameter; • Zero inflated discrete response: zero inflated Poisson (ZIP) distribution, i.e. Y ∼ ZI P (μ, σ ), where y = {0, 1, . . .}, μ > 0 is the mean of the Poisson component and 0 < σ < 1 is the exactly probability of Y = 0; • Censored response: Weibull distribution, i.e. Y ∼ W ei(μ, σ ), where Y > 0, μ > 0 is the mean and σ > 0 represents a scale parameter. All link functions were chosen based on the range of each parameter Stasinopoulos et al., 2017). When θ k is defined in the real support, the identity link function was used, when θ k is positive the logarithm function was adopted and, finally, for θ k on the unit interval, the logit link function was applied. It is noteworthy that in all scenarios we are using the default initial values in stepGAICAll.A() function in the gamlss package for the response variable distribution parameter vectors. As stated in Stasinopoulos et al. (2017), although we must use initial values for the distribution parameter vectors, the estimation algorithm is stable and fast using simple starting values (e.g. constants) for the parameter vectors. Nonetheless, users can easily set any different values as needed (Stasinopoulos and Rigby, 2007;Stasinopoulos et al., 2017).
Based on the three above mentioned distributions, we will consider two different main scenarios, one considering a linear structure (the simulated data consider only linear relationships between covariates and each of the distribution parameters) and one with both linear and nonlinear structures (nonlinear relationships are also considered). The sample sizes are generated by taking n = 150 and n = 300 and, for each scenario, all results are obtained from 1,000 Monte Carlo replications.
In all scenarios, eight different covariates will be used in the model selection process. • Binary variable: Furthermore, different coefficient values associated to x 5 and x 6 were considered in μ and σ structures, respectively, while holding all other coefficients as constants. The same above mentioned data generation process was conducted.
The variables X 4 and X 8 will be considered as noise variables, i.e., although they are included in the model selection procedure, they will not be considered in the data-generating process. Moreover, different slope values (i.e. coefficient magnitude) are evaluated in the simulation study, as will be further highlighted. For each replication, the Strategy A method selects the model parameters, then, at the end of all replications, the percentage of correct/incorrect specification in each distribution parameter is calculated. A summary of the returned p-values for the selected variables in each scenario, considering the different used penalties, are displayed in Appendices A and B. It is noteworthy here that many recent works highlight the danger of using naive p-values after the model selection stage (Lee et al., 2016), however we are providing these values only to get a sense regarding the significant parameters.

Linear Structure
In the scenarios where only linear relationships are allowed, we have a discrete (X 3 ) and a continuous (X 7 ) variable affecting both μ and σ parameters simultaneously (Table 1). Moreover, variables X 1 and X 5 will only be considered in the generating process for μ, and variables X 2 and X 6 for parameter σ .
Normal Data Let us consider the random variable Y ∼ N(μ, σ ). Here we consider the following regression structures for the two parameters of the normal distribution μ = β 01 + β 11 x 1 + β 31 x 3 + β 51 x 5 + β 71 x 7 and σ = exp[β 02 + β 22 x 2 + β 32 x 3 + β 62 x 6 + β 72 x 7 ], where the true parameter values in the data-generating processes are μ = 40 + 5x 1 − 3x 3 + 2.5x 5 − 3x 7 and Note that this model (when both parameters are affected by explanatory variables) is also known in the literature as the heteroscedastic normal regression model. Table 1: Continuous (cont) and discrete authentic variables considered on the data generating process (linear structure).

Parameter
x  We may conclude that, when compared to other variables, x 5 and x 7 presented the worst results for the mean μ. In its worst scenario (using BIC for n = 150 as can be seen in Panel (c)) the correct selection percentage of their selection were approximately 20% for x 5 and 27.5% for x 7 . One possible explanation for this, as highlighted at the end of Section 2.1, is that the effect of continuous covariates in the log-likelihood function is smaller than the one exerted by a categorical variable (both with the same range in this example). Moreover, x 5 has the smallest coefficient.
A more serious problem arose when the procedure selected the set of explanatory variables for the standard deviation parameter σ , where x 6 and (especially) x 7 were not selected very often. Again, this problem may be explained by the effects caused by these coefficients associated to these variables (this statement is true only if both continuous and categorical variables are within the same range and their associated coefficients are similar).
Regarding Figure 1(d), we may conclude that, greater values of slope imply a greater correct selection rate. Moreover, as expected, for a greater sample size, Strategy A will perform better, i.e., it will correctly select the given variable. Also, as we increase penalty κ, more rigorous is the selection of these variables.
Analyzing the p-values for the selected variables for different settings (see Supplementary Material), we may conclude that BIC criterion results in the smallest p-values, followed by GAICav and AIC criteria. As expected, when the sample size increases, the variables, when selected, become more significant. Looking at x 6 and x 7 , for sigma parameter, we may note that, even for the smallest penalty (AIC), in more than 50% of the times they were correctly selected, they were not significant at a 5% level, due to their low coefficient values and also because they are continuous variables.

Count Data
Let us consider the random variable Y ∼ ZI P (μ, σ ). The regression structures used to generate the data were μ = exp[0.4 + 0.4x 1 + 0.04x 3 + 0.04x 5 + 0.05x 7 ] and σ = logistic[−2.11 + 0.75x 2 + 1.85x 3 + 0.50x 6 + 0.63x 7 ]. Figure 2 shows the percentage of selected covariates in each of the scenarios for the ZIP model. Panels (a, b, c) show that in all cases, the percentage of authentic variables that were in fact selected for the mean of the Poisson component μ (i.e., it should be included in the model for μ) using the Strategy A procedure is relatively low, except for x 1 . However, this does not mean that there is a problem in the considered model, but actually this might be explained by the low coefficient value associated to x 3 (0.04) compared to the coefficient associated to x 1 (0.4) and, as explained in the previous scenario (normal data), the effect of continuous covariates (x 5 and x 7 ) since they are in the same range of the authentic discrete variables. Further, Panel (d) displays the correct selection rate for different values of slope, considering all other variables as fixed.
Regarding the probability of zero σ , all scenarios present a similar behavior verified in the normal data case, since the continuous variables (x 6 and x 7 ) were poorly selected, especially when BIC criterion was considered. Noise variables were selected very few times in the simulation study. In the worst scenario (considering AIC value and n = 150 as the selection criterion), noise variables were selected less than 20% of the times. All p-values related to this case are presented in Supplementary Material.
Results of this simulation study are presented in Figure 3. The percentage of authentic explanatory variables selected to compose the regression structure for the mean parameter μ and σ present the same behavior of the previous cases, i.e. high selection rate for discrete variables and low percentage for continuous ones. Also, as expected, AIC criterion presents a greater correct selection rate (Panel (d)). Finally, the returned p-values of these scenarios are presented in Supplementary Material.

Linear and Nonlinear Structures
In the scenarios where both linear and nonlinear relationships are considered, we have a discrete variable (X 3 ) affecting both μ and σ parameters simultaneously (Table 2). Variables X 1 , X 5 and X 7 will only be considered in the generating process for μ, and variables X 2 and X 6 for parameter σ . Moreover, X 7 will be generated in such a way that its effect in μ has an increasingdecreasing-increasing shape.
Normal Data Let us consider once again that Y ∼ N(μ, σ ). Now, assuming that X 7 has a nonlinear effect in μ, we will consider the following regression structures for the two parameters Table 2: Continuous (cont) and discrete authentic variables considered on the data generating process (linear and nonlinear structures).
As we can see in Figure 4, the correct selection rate for both parameters (μ and σ ), including different values of slope (Panel (d)), of the heteroscedastic normal regression model, based on the Strategy A method, returned similar results (consequently, the same discussion can be applied here) to the ones presented when only linear structures were considered. Once again, as highlighted in Section 2.1, the effect of continuous covariates in the log-likelihood function is smaller than the one applied by a categorical variable, when they are within the same range.
Please check Supplementary Material for the p-values of the selected (authentic and noise) variables. Please note that the covariate x 7 is omitted from these plots since a P-spline is being considered to model its relationship with both parameters μ (authentic) and σ (noise), thus the resulting coefficients of each smoother and its standard error (consequently its p-value as well) refer only to the linear part of the smoother and not to the smoother's contribution as a whole Ramires et al., 2019). As in the linear structure scenario, we may conclude that BIC criterion returned the smallest p-values and, as expected, when the sample size increases, the variables, when selected, become more significant.
Finally, regarding noise variables, in the worst case scenario, variable x 7 was roughly selected more than 25% of the times to compose the regression structure of σ (considering AIC with a sample size equals to 150).

Count Data
Considering Y ∼ ZI P (μ, σ ) and that x 7 has a nonlinear effect in μ, the following regression structures were used to generate the data μ = exp [0.4 + 0.4x 1 + 0.04x 3 + 0.04x 5 + 0.2 sin(0.2π x 7 )] and σ = logistic[−2.11 + 0.75x 2 + 1.85x 3 + 0.5x 6 ].  Figure 5 displays the percentages of selected authentic and noise variables in this simulation study. The observed behavior for both parameters μ (mean of the Poisson component) and σ (exactly probability of Y = 0) is quite similar to the one in the scenario where only a linear structure was considered in the ZIP model, presented in Figure 2. All p-values from this simulation study are presented in Supplementary Material. Please note, once again, that covariate x 7 is omitted due to the P-spline considered to model its relationship in both regression structures.

Censored Data
In this final simulation study, let us consider Y ∼ W ei(μ, σ ) and a nonlinear effect in μ. The following regression structures were used to generate the data μ = exp 3.98 + 0.09x 1 − 0.07x 3 + 0.04x 5 + log (1 + 10 sin(0.2π x 7 )/40) and The results are presented in Figure 6. As can be seen the observed percentages of variable selection in all scenarios are quite similar to the one presented in Figure 3, i.e. a high selection rate for authentic discrete variables, while the AIC criterion presented the greatest correct selection rate (Panel (d)). All p-values of this simulation study are presented in Supplementary Material.

Concluding Remarks
The Strategy A procedure, accessed through the stepGAICAll.A() function implemented in the gamlss package in R is a stepwise-based procedure to select variables based on the generalized Akaike information criterion (GAIC). Due to the GAIC definition, the selection of a given variable will directly depend on how much such variable increases the log-likelihood function (decreasing the GAIC value), weighted by the increasing on the number of degrees of freedom. Thus, continuous covariates tended to be selected fewer times than discrete ones if they have some similarity. Nonetheless, three levels of penalties κ were considered, showing that the AIC criterion tends to include more variables in the model (including those that should not be selected, i.e. noise variables), BIC tends to select only covariates with strong effects and GAICav was a moderate criterion, which is somewhat expected. Furthermore, as sample size increases, the results of the stepwise-based procedure become better and more realistic. Finally, as suggestions for future research, we can think in the behavior of the Strategy A procedure considering more complex distributions (three or more parameters), different shapes for the nonlinear term, the presence of interactions between covariates and others.

Supplementary Material
Please note that the following supplementary files are available online: i) suppl_stepgaic.pdf: p-values for the selected variables in each simulated scenario; and ii) codes_stepgaic.zip: all codes in R software that were used to conduct the simulation studies presented in this paper.