Sign-based Shrinkage Based on an Asymmetric LASSO Penalty

Penalized regression provides an automated approach to preform simultaneous variable selection and parameter estimation and is a popular method to analyze high-dimensional data. Since the conception of the LASSO in the mid-to-late 1990s, extensive research has been done to improve penalized regression. The LASSO, and several of its variations, performs penalization symmetrically around zero. Thus, variables with the same magnitude are shrunk the same regardless of the direction of eﬀect. To the best of our knowledge, sign-based shrinkage, preferential shrinkage based on the sign of the coeﬃcients, has yet to be explored under the LASSO framework. We propose a generalization to the LASSO, asymmetric LASSO, that performs sign-based shrinkage. Our method is motivated by placing an asymmetric Laplace prior on the regression coeﬃcients, rather than a symmetric Laplace prior. This corresponds to an asymmetric (cid:2) 1 penalty under the penalized regression framework. In doing so, preferential shrinkage can be performed through an auxiliary tuning parameter that controls the degree of asymmetry. Our numerical studies indicate that the asymmetric LASSO performs better than the LASSO when eﬀect sizes are sign skewed. Furthermore, in the presence of positively-skewed eﬀects, the asymmetric LASSO is comparable to the non-negative LASSO without the need to place an a priori constraint on the eﬀect estimates and outperforms the non-negative LASSO when negative eﬀects are also present in the model. A real data example using the breast cancer gene expression data from The Cancer Genome Atlas is also provided, where the asymmetric LASSO identiﬁes two potentially novel gene expressions that are associated with BRCA1 with a minor improvement in prediction performance over the LASSO and non-negative LASSO.


Introduction
Recent developments in data acquisition, collection, and storage have allowed researchers to obtain a large number of potential predictors in order to avoid missing important factors that may be associated with the outcome of interest. This is often the case in genomic studies, where the number of predictors collected is often larger than the sample size. Simultaneous variable selection and parameter estimation is an essential task in high-dimensional data analysis that aims to identify a smaller subset of important variables. Penalized regression methods accomplish this by shrinking the regression coefficients toward zero while setting some coefficients equal to zero. These methods estimate a sparse vector of regression coefficients by minimizing an objective function that is composed of both a loss function and a penalty function. One of, if not, the most popular penalized regression methods is the LASSO (Tibshirani, 1996). Since its conception in the mid-to-late 1990's, the LASSO framework has been extensively used in several different research areas including, but not limited to, signal processing (Angelosante et al., 2009), genomic studies (Huang and Pan, 2003;Ghosh and Chinnaiyan, 2005;Wu and Lange, 2008;Wu et al., 2009Wu et al., , 2011, finance (Wu et al., 2014;Pereira et al., 2016;Panagiotidis et al., 2018), and text mining (Li et al., 2014;Debortoli et al., 2016). LASSO performs estimation and selection by forcing the sum of the absolute value of the regression coefficients (the 1 norm) to be less than a non-negative fixed value which, consequently, forces some of the coefficients to zero. From a Bayesian perspective, the LASSO is motivated by placing a Laplace prior on the regression coefficients (see e.g., Tibshirani, 1996;Park and Casella, 2008;Hans, 2009). The density of the Laplace distribution is provided in Figure 1 (solid black line). The prior is symmetric around 0 implying that the degree of shrinkage for a particular magnitude is the same regardless of the direction of effect. Several extensions and improvements, both in estimation and computation, to the LASSO have been proposed in the literature (see e.g., Tibshirani, 1997;Fan and Li, 2001;Zou and Hastie, 2005;Tibshirani et al., 2005;Yuan and Lin, 2006;Meinshausen and Bühlmann, 2006;Friedman et al., 2008;Zou, 2006;Wu and Lange, 2008;Friedman et al., 2010;Zhang et al., 2010;Tibshirani et al., 2012).
While extensive research has been done to improve penalization by shrinking the magnitude of the coefficients differently (Zou, 2006), to the best of our knowledge, preferential shrinkage based on the sign of the coefficients has yet to be explored. Traditionally, LASSO and other penalized regression procedures shrink variables symmetrically around 0. That is, the degree of shrinkage for a particular magnitude is the same regardless of the direction of effect. There are several motivating scientific questions in which shrinking both positive and negative coefficients equally may not be preferred in certain situations. Most likely these studies leverage some previous knowledge in which we expect effects to be favored in one direction over another or if we are particularly interested in one effect direction. For example, BRCA1 is a DNA damage repair gene that has been shown to have strong associations with breast and ovarian cancer risk (Welcsh et al., 2000;Welcsh and King, 2001). Based on this knowledge, we may be interested in identifying additional genes that are associated with BRCA1 expression; in particular, genes with a positive association. These genes could implicate mechanisms through which BRCA1 impacts cancer risk and warrant further investigation in future studies. Since we are interested in identifying genes that have an elevated effect on BRCA1, we may want to focus our attention to selecting positive effects, while also allowing for the possibility of identifying strong negative effects. Additionally, in certain genomic studies designed for the construction of a polygenic risk score (PRS), there is an emphasis on identifying risk variants for certain diseases (Khera et al., 2018). Variable selection procedures, such as the LASSO, are often employed to identify relevant markers that are used in developing a PRS. Finally, certain biomarkers within known biological pathways may be suspected to be associated with elevated risk (i.e., a positive association with the disease outcome). In a metabolomic investigation we may be particularly interested in discovering additional biomarkers with smaller risk effects which may help elucidate the biological mechanisms of the disease.
Generalizations to the LASSO, such as the constrained LASSO, have been developed to augment the standard LASSO with linear equality and inequality constraints (Efron et al., 2004;James et al., 2012;Tibshirani and Taylor, 2011;Wu et al., 2014;Gaines et al., 2018). The nonnegative LASSO is an example of the constrained LASSO that requires the LASSO coefficients to be nonnegative. At first glance, this formulation seems to solve the issue of preferential shrinkage since it forces effect estimates to be positive. However, these linear constraints must be specified a priori and can be problematic if negative effects are present. It would be ideal to develop a LASSO-based variable selection method that can perform preferential shrinkage without the need to place a priori constraints on the parameter space.
Motivated by this idea, we propose a new variation of LASSO penalization that accomplishes asymmetric shrinkage. Our proposed method, asymmetric LASSO, replaces the standard 1 penalty with an asymmetric 1 penalty. In doing so, the asymmetric LASSO performs preferential shrinkage through an auxiliary tuning parameter that controls the degree of asymmetry. While estimation is focused under a penalized regression framework, we provide a Bayesian interpretation that motivates the use of the asymmetric 1 penalty. Specifically, one can view the asymmetric LASSO as placing an asymmetric Laplace prior on the regression coefficients. We also show that the standard LASSO is a special case of the asymmetric LASSO. Since the objective function is convex, we employ an efficient optimization algorithm for our implementation.
The paper is organized as follows. In Section 2 we introduce the asymmetric LASSO under a generalized linear model framework. We provide insight into the behavior of the estimator under the ordinary least squares model with orthogonal design. Simulation studies are provided in Section 3 to explore the empirical properties of asymLASSO and compare its performance to both the traditional LASSO and non-negative LASSO across several scenarios. We provide a real data example using the breast cancer gene expression data from The Cancer Genome Atlas (TCGA) in Section 4. Finally, parting comments and future directions and discussed in Section 5.

The Asymmetric LASSO
Let us consider the generalized linear model (GLM) with a response vector y and design matrix X = (x 1 , . . . , x n ) T , assume that the observations v i = (x T i , y i ) T , i = 1, . . . , n, are mutually independent, and that, conditional on x i , y i belongs to the exponential family with the following density where θ is defined as the canonical parameter, φ > 0 is the scale (dispersion) parameter and a(φ), b(θ), and c(y, φ) are known functions whose values depend on the distribution (McCullagh and Nelder, 1983;Dobson and Barnett, 2018). If we assume that a(·) is twice differentiable, then Model (1) indicates that E(y i |x i ) = μ i = a (θ i ) and var(y i |x i ) = a (θ i )b(φ i ). Furthermore, the canonical parameter θ is connected to x i through a prespecified link function h(μ i ) = x T i β for some β = (β 1 , . . . , β p ) T . Examples of commonly used GLMs with canonical link include linear regression, logistic regression, and Poisson regression. We can now define the likelihood function for β, (2) Consequently, the log-likelihood is defined as l(β) = log L(β; v i ). The regression coefficients β are typically estimated through minimizing the negated log-likelihood function. Typically, not all of the p covariates that are included in the data are associated with the outcome and interest lies in estimating a sparse β (i.e., several values of β are 0). This is especially the case in the highdimensional (p > n) setting. Penalized regression provides an automated approach to perform simultaneous variable selection and parameter estimation.
To conceptualize asymmetric penalization, we motivate the idea under a Bayesian framework where we propose to model the regression coefficients using an asymmetric Laplace prior where λ 0 is the scale parameter and τ ∈ (0, 1) is the skewness parameter that controls the asymmetry. Two examples of the asymmetric Laplace distribution are provided in Figure 1 (dotted grey line) for τ = 0.25 ( Figure 1a) and τ = 0.75 (Figure 1b). In both figures, we see that the distribution is still concentrated at 0; however, the behavior of the tails is asymmetric. When τ = 0.25, the left and right tail of the density are narrower and wider than the standard Laplace distribution, respectively. More mass is reserved for positive-valued β than for negativevalued β. The converse is true when τ = 0.75. We can allow that data to dictate the choice of τ , allowing us to perform sign-dependent shrinkage in a data-driven manner rather than a prespecified constraint as in the constrained LASSO.
In the context of penalized regression, the LASSO estimates are obtained by minimizing an objective function that is composed of the negated log-likelihood function plus an 1 penalty function. It is easy to show that the 1 penalty, |β|, is proportional to the negated log-density of the standard Laplace distribution. Like LASSO, imposing an asymmetric Laplace prior on β has a direct correspondence to estimation using a penalized likelihood. By rewriting the check function f (x) = x(τ − I (x < 0)) = (|x| + (2τ − 1)x)/2, the asymmetric Laplace distribution will correspond to an asymmetric 1 penalty, |β j | + (2τ − 1)β j and our asymmetric LASSO (asymLASSO) estimator is defined aŝ The use of the check function as the basis for the penalization term in (4) and placing an asymmetric Laplace prior on β are intrinsically connected to quantile estimation and quantile regression (Koenker and Basset, 1978;Yu and Moyeed, 2001;Yu and Zhang, 2005;Kozumi and Kobayashi, 2011;Takeuchi et al., 2006). Specifically if β follows an asymmetric Laplace distribution with location parameter 0, scale parameter 1/(2λ), and skew parameter τ , as in (3), then Pr(β < 0) = τ and Pr(β > 0) = 1 − τ , and therefore 0 can be interpreted as the τ -th quantile of the distribution. In the following section we show how τ impacts estimation when compared to the standard LASSO.

The Behavior of asymLASSO Under Orthogonal Design
To better understand the behavior of asymLASSO, we investigate the OLS model under an orthogonal design matrix (i.e., X T X = I p with p < n). Under these conditions, asymLASSO leads to the following closed-form solution where S(a, b) = sgn(a)(|a|−b) + is the soft-thresholding operator (Donoho and Johnstone, 1994) defined for λ 0 andβ ols j = x T j y is the OLS estimate. Equation (5) follows a modified version of the LASSO and, in fact, is equivalent to the LASSO when τ = 1/2. Thus LASSO can be viewed as a special case of asymLASSO. Furthermore, as λ → 0 we have thatβ j (ols; τ ) →β ols j , which implies that if λ = o(1), thenβ(ols; τ ) is a consistent estimator for all τ ∈ (0, 1). Figure 2 illustrates the behavior of the asymmetric LASSO soft-thresholding operator provided in Equation (5) under orthogonal design with τ = 0.25, 0.5 and 0.75. We can compare the three panels in terms of their effect on both bias and sparsity. Note thatβ j (ols; τ ) = 0 whenever −2λ(1 − τ ) β ols j 2λτ . As discussed earlier, asymLASSO with τ = 0.5 ( Figure 2a) reduces to the LASSO. In this panel we see thatβ j (ols; 0.5) = 0 whenever −λ β ols j λ. Furthermore, the nonzero values are penalized by a constant factor, λ, as indicated by the difference between the dotted gray line (true value of β) and solid black line (asymLASSO shrinkage). Figure 2b illustrates asymLASSO with τ = 0.25 and we see that the thresholding function is shifted to the right. When compared to the LASSO (Figure 2a) positive-valued estimates will be less biased and less likely to be shrunk to 0 compared to negative-valued estimates of the same magnitude. Therefore, scenarios where we expect more positive-valued (and smaller) effect estimates will benefit from asymLASSO over the standard LASSO. We see the opposite relationship in Figure 2c where we set τ = 0.75. In this situation, asymLASSO favors negative-valued effect estimate over positive-valued effect estimates. Under the orthogonal design, we can explicitly quantify the shrinkage seen in Figure 2 for general τ . To understand these properties better, we can think about the solution path as two components: Case 1:β ols j > 0. For this case we are only concerned with the positive estimates of asym-LASSO. Hereβ j (ols; τ ) = 0 wheneverβ ols j ∈ [0, 2λτ ]. Hence when τ < 1/2,β j (ols; τ ) is shrunk to 0 over a smaller interval than LASSO. In fact, estimates whereβ ols j ∈ (2λτ, λ] will be 0 for LASSO and nonzero forβ j (ols; τ ). Therefore, asymLASSO with τ < 1/2 will select smaller positive effect estimates than the LASSO. Whenβ ols j > 2λτ ,β j (ols; τ ) =β ols j − 2λτ . Again when τ < 1/2,β ols j <β ols j − 2λτ <β ols j − λ, and asymLASSO provides a less biased estimate when compared to LASSO. However when τ > 1/2, we have λ < 2λτ and therefore asymLASSO tends to overshrink and produce more biased estimates compared to LASSO.

as in the ordinary least squares model and let
Definingβ as the solution to (4), if A holds, then under mild regularity conditions.
The proof is provided in the Online Supplementary Material and mirrors similarly to the proof for the ordinary LASSO estimator. Note that (1 + |2τ − 1|) ∈ (1, 2) and equals one when τ = 1/2. Therefore these bounds can be larger (up to a constant) than the bounds for ordinary LASSO estimator. Consequently, if X T X = I p , then we also have β −β 2

Implementation via Cyclic Coordinate Descent
For notational convenience, we suppress the dependence of τ inβ. Letting ∇l(β) = ∂l(β)/∂β = X T u and ∇ 2 l(β) = ∂ 2 l(β)/∂β∂β T = X T W X, we approximate the log-likelihood based on a Taylor series expansion about the current iteration β (m) : whereỹ is the working response vectorỹ = Xβ (m) + W −1 u. Note here that u, W , andỹ are dependent on β (m) . With this approximation, efficient convex optimization algorithms can be used to minimize (4). We employ cyclic coordinate descent, a widely-used algorithm for penalization (Wu and Lange, 2008;Friedman et al., 2010;Breheny and Huang, 2011), for our implementation. The algorithm starts by setting all p variables to some initial value (e.g. β (0) = 0). It then solves a one-dimensional optimization problem by setting the first variable (j = 1) to a value that minimizes the objective function while holding all other variables constant. This process is repeated for the second variable, third variable, and so on. When the algorithm cycles through all the variables, it returns to the first variable and repeats the cycling process until some convergence criterion is met. For asymLASSO, the one-dimensional update for the j th covariate at the (m + 1) th iteration is where v j is the j th diagonal element of V = X T W X and r j is the j th element of r = X T W u + V β (m) . Typically, we are interested in obtaining estimates forβ over a range of values between a maximum value λ max for which all coefficient estimates are 0 to a minimum value λ min at which the model becomes excessively large (saturated) or ceases to be identifiable. For the LASSO, λ max = max j {|r j |} when the quadratic approximation is taken with respect to the interceptonly model (Friedman et al., 2010). This is due to the fact that the LASSO estimates are zero whenever |r j | λ for all j . The asymLASSO estimates, however, are zero whenever |r j − λ(2τ − 1)| λ. This complicates finding a value for λ max since shrinkage is not symmetric about 0. We propose to use a conservative value for λ max given by λ max = max j This bound is equivalent to the LASSO bound when τ = 1/2 and larger otherwise.

Selection of τ and λ
Model complexity depends critically on the choice of the tuning parameters. As evident in Section 2.2, τ induces a "sign-specific shrinkage tradeoff" that determines whether emphasis is placed on shrinking positive or negative-valued effects. While one can consider specific biological scenarios in which τ can be selected a priori, as shown in Section 3.2, a naively prespecified value can lead to biased estimation and improper shrinkage. The penalization parameter λ dictates the degree of shrinkage and therefore must be carefully selected. In practice, one generally implements a penalization method across a grid of tuning parameters and selects the tuning parameter that minimizes some criterion. Since estimating both τ and λ is of interest, we use a two-dimensional grid search to select the optimal pair (τ opt , λ opt ). Several criteria have been proposed in the literature including, but not limited to, k-fold cross validation, generalized cross validation (Golub et al., 1979), the Akaike information criterion (Akaike, 1974) and the Bayesian information criterion (Schwarz et al., 1978).

Numerical Studies
A series of simulations are conducted to illustrate the performance of asymLASSO under various design settings. All computations are carried out using the R programming language. The design matrix X = (x T 1 , . . . , x T n ) is generated from a multivariate Gaussian distribution with mean 0 and variance-covariance matrix . We allowed for mild correlation between covariates by specifying an autoregressive covariance structure, = 0.5 |i−j | . The data are generated from a normal linear where μ is the intercept term. Clarification of the simulation parameters, such as the structure of β * , is provided in the corresponding subsections.

Sensitivity to τ
As noted earlier Section 2.4, preferential shrinkage of positive or negative effects is dictated by τ . We investigate the effect τ has on the selection performance of asymLASSO. We used an evenly-spaced grid on the interval [0.05, 0.95] for τ . For each value of τ , we used five-fold cross validation over a data-driven grid of 20 values to estimate λ. We set n = 400, μ = 0.10, and β * = (−0.03, 0, 0, −0.03, −0.03, 0.03, 0.03, 0, 0, 0.03) and vary σ y ∈ {0.3, 0.5}. Furthermore, we let = I 10 so that the covariates are independent. We compared the following approaches: 1) asymLASSO with fixed τ ∈ {0.05, 0.25, 0.5, 0.75, 0.95} and 2) asymLASSO with τ also being estimated via cross validation, and evaluated their selection performance through the inclusion probability (P j ), the proportion of simulations that correctly identify β * j as non zero. We report our findings in Table 1 where the results are averaged over B = 100 Monte Carlo replicates.
We can see that for τ < 0.5, asymLASSO has a higher probability of selecting the positivesigned effects (P 6 , P 7 , and P 10 ) over the negative-signed effects (P 1 , P 4 , and P 5 ) the degree to which is determined by the value of τ ; whereas, the opposite is true for τ > 0.5. By construction of the parameter vector, we do not expect to prefer shrinking positive effects over negative effects or vice versa. In fact, when τ = 0.5 (i.e., the standard LASSO) we see that the inclusion probabilities for all six non-zero variables are comparable. Furthermore, the estimated optimal value for τ , averaged over all 100 simulations, is close to 0.5, suggesting that a data-driven method should be used to select τ rather than a prespecified value. We also assessed selection performance under correlated covariates (Tables S1 and S2 in the Online Supplementary Material). As expected, selection performance worsens when correlation is present; however, the conclusions generally remain consistent with what we observe in Table 1.

Finite Sample Performance Compared to the LASSO
In this section we study the finite sample performance of asymLASSO compared to both LASSO and the non-negative LASSO (nLASSO). We let β * = (β 0 , 0 p−10 ), where we set μ = 0.10 and Table 1: Asymmetric LASSO (asymLASSO) with varying values for τ where n = 400, = I 10 , μ = 0.10, and β * = (−0.03, 0, 0, −0.03, −0.03, 0.03, 0.03, 0, 0, 0.03). The tuning parameter λ was selected using five-fold cross validation between an evenly-spaced grid [0.05, 0.95]. Results are averaged over 100 simulations.τ CV is the average value of τ selected via cross validation for each of the 100 simulations (P j = proportion of simulations where β j is correctly identified as non-zero). See Section 3.1 for more details. For asymLASSO, we used an evenly-spaced grid on the interval [0.05, 0.95] to select τ . Oracle estimates were retrieved from OLS regression using the underlying true model. Both LASSO and the non-negative LASSO were performed using the glmnet package (Friedman et al., 2010). A data-driven grid of 20 λ values was employed for all three methods and five-fold cross validation was used to select the final model. We evaluate the approaches by their variable selection, parameter estimation, and prediction performance. For variable selection, we used the probability of inclusion measures defined in Section 3.1 as well as the mean number of false positives (FP) and mean number of false negatives (FN). Estimation bias is estimated using the mean squared bias, where B is the number of simulations. Lastly, prediction performance is estimated using the predicted mean squared error (PMSE) derived from a test set of n = 1,000. Results are averaged over B = 100 Monte Carlo replicates and are presented in Table 2 for Model 1 when n = 400 and 800, p = 50 and 200, and = (0.5 |i−j | ) ij .
First, we observe that estimation and prediction performance between the three methods are comparable. However, both nLASSO and asymLASSO have better selection performance than the traditional LASSO across all five true non-zero coefficients, especially for the smaller Table 2: Comparison of asymLASSO to LASSO and the non-negative LASSO (nLASSO) based on 100 Monte Carlo replicates. (MSB = mean squared bias; FP = mean number of false positives (out of 45); FN = mean number of false negatives (out of 5); P j = proportion of simulations where β j is correctly identified as non-zero; PMSE = Averaged predicted mean squared error.) See Section 3.2 for more details. For example, when n = 400, the probability of inclusion for β * 04 = 0.03 for the LASSO is 43% compared to 47% for nLASSO and asymLASSO. Our estimated value for τ using cross validation isτ = 0.20 < 0.50, which is expected since our true model is comprised of only positive signals. Furthermore, nLASSO tends to identify less false positives compared to LASSO and asymLASSO. As the sample size increases (n = 800), all three methods have improved overall performance but the patterns between them remain the same. In Table S3 of the Online Supplementary Material we repeat the same scenario but under two different correlation structures ( = I and = 0.80 1(i =j ij ). When the covariates are independent, the results mirror what we observe in Table 1. Surprisingly, under high equicorrelation, all three methods perform similarly in terms of selection while asymLASSO identifies slightly more false positives.
In our previous example, the performance of asymLASSO falls somewhere between the LASSO and nLASSO. asymLASSO had better selection performance than the LASSO but at the expense of identifying more false positive than nLASSO. Moreover, we expected nLASSO to perform well under the previous setting since the effect sizes were all positive. A predetermined constraint was placed to ensure that only positive effects were retained in the model for nLASSO; whereas asymLASSO allowed the data to dictate the shrinkage, which preferred selecting positive effects over negative ones. In models where negative effects are present, we would expect nLASSO to perform poorly. We further illustrate this in Table 3 where we allow  the coefficient estimates to vary in sign under Models 2, 3, and 4. In general, nLASSO produces sparser models than both LASSO and asymLASSO. Under Model 2, where the smallest and largest effects are negative, nLASSO fails to select the largest effect. Surprisingly, nLASSO selected the smallest negative effect but erroneously estimated its effect as positive. We see this same pattern in Models 3 and 4 where the two largest and two smallest effect sizes are negative, respectively. Focusing our attention to LASSO and asymLASSO, both methods have comparable performance and there is difficulty in preferring asymLASSO over LASSO and vice versa. For example, in Model 3, asymLASSO has better selection performance for the positive-signed coefficients compared to the LASSO but worse selection performance for the two negative-signed coefficients due, in part, by their magnitude. We also see that asymLASSO is aiming to balance the sign-specific shrinkage trade off based on the sign and magnitude of the effects that are present in the data, as reflected by the estimated value for τ for each model, providing a data-driven approach to asymmetric penalization that doesn't require an a priori constraint as in the constrained LASSO. Lastly, we compare all three methods in a high-dimensional setting with n = 400, p = 2,000, = (0.5 |i−j | ) ij , and under Models 1 and 2. We increase the signal size of the smallest effect to 0.08 (Table 4). We notice similar results to what we have observed previously (Tables 2 and 3). While asymLASSO identifies slightly more false positives; the mean false positive rates (the number of false positives over 1,995) are comparable across all three methods. Table 4: Comparison of asymLASSO to LASSO and nLASSO in a high-dimensional OLS setting. Results based on 100 Monte Carlo replicate with n = 400 and p = 2,000. Model 1: β 0 = (0.08, 0, 0, 0.08, 0, 0.10, 0.10, 0, 0, 0.15); Model 2: β 0 = (−0.08, 0, 0, 0.08, 0, 0.10, 0.10, 0, 0, −0.15). (MSB = mean square bias; FP = mean number of false positives (out of 1,995); FN = mean number of false negatives (out of 5); P j = proportion of simulations where β j is correctly identified as non-zero; PMSE = Averaged predicted mean squared error.) See Section 3.2 for more details.

Model
Method MSB FP FN P 1 P 4 P 6 P 7 P 10 PMSE

Assessing Sign-Dependent Shrinkage
Our simulations from Section 3.2 show that asymLASSO outperforms LASSO in terms of variable selection when the true effects are in the same direction. In the presence of mixed-sign and mixed-magnitude effects, asymLASSO and LASSO have their own respective benefits and drawbacks. When the covariates in the model are independent (i.e., is a diagonal matrix), simple transformations on the columns of the design matrix will change the magnitude and/or direction of the corresponding coefficient estimate. For example, negating the entries in a column will switch the sign of the estimate from positive to negative (or vice versa).
In the following study, we compare the performance of LASSO and nLASSO to asym-LASSO under simple sign transformations of the design matrix. Our simulation setup follows similarly to Section 3.2 except that = I p to ensure independence among the covariates and we set σ y = 0.5. Furthermore, we set the two smallest signals to be negative, i.e., β 0 = (−0.03, 0, 0, −0.03, 0, 0.05, 0.05, 0, 0, 0.08) as in Model 2. Our interest lies in comparing the probability of inclusion for the two smallest signals β * 1 = −0.01 and β * 4 = −0.01 before and after we switch the signs of the first and fourth columns of X. In other words, we generate a design matrix X, simulate the outcome y|X, and create a new design matrixX such that for all i = 1, . . . , n:X Thus,X only differs from X in the first and fourth column where the entries ofX are negated entries of X. In doing so, regressing the outcome y onX will produce positive effect estimates for the first and fourth coefficients and thus will be in the same effect direction as the other non-zero values. Unlike LASSO, asymLASSO is sign variant and we believe that selection performance will improve when usingX as the design matrix in the model over X. We evaluate the selection performance of LASSO, nLASSO and asymLASSO when using either X orX and present the results in Figure 3 when n = 400 and p = 50. While the data generation scheme is slightly different, the results for all three methods using X as the design matrix reflect what we observed for Model 2 in Table 6. Specifically, the probabilities of inclusions, P 1 and P 4 , are lower for asymLASSO than LASSO since asymLASSO prefers selection of positive effects (τ = 0.23). Similarly, nLASSO incorrectly assigns positive effect estimates to both β 1 and β 4 . We see a drastic improvement in selection performance when we useX as the design matrix for asymLASSO since the effects of interest are coded to be in the same direction as the other (larger) non-zero effects (τ = 0.22). The same is true for the nLASSO. Additionally, due to LASSO shrinking symmetrically around zero, the performance of LASSO is unchanged. We perform additional simulations (Figures S1 and S2 in the Online Supplementary Material) where we introduce correlation between the covariates. The overall conclusions are consistent to what we observe in Figure 3. Furthermore, as previously mentioned, the selection performance for asymLASSO is similar to the LASSO when the covariates are highly correlated ( = 0.8 1(i =j) ij ). These results show that by cleverly transforming the design matrix such that the expected effects are mostly (or all) in one direction, asymLASSO demonstrates better selection per-  (0.08, 0, 0, 0.10, 0, 0.12, 0.15, 0, 0, 0.25); Model 2: β 0 = (−0.08, 0, 0, 0.10, 0, 0.12, 0.15, 0, 0, −0.25). (MSB = mean square bias; FP = mean number of false positives; FN = mean number of false negatives; P j = proportion of simulations where β j is correctly identified as non-zero; AUC = Area under curve estimate from the test set.) See Section 3.4 for more details.

Model
Method MSB FP FN P 1 P 4 P 6 P 7 P 10 AUC formance of small effects compared to the LASSO. This is particularly applicable in genomics studies where covariates may be coded a priori in the risk direction where increases in the covariate value correspond to higher risk for the outcome and thus potentially allows for the discovery of small effects that may have been erroneously shrunken to zero by the LASSO.

Binary Outcome
To highlight the application within the GLM framework, we compared LASSO and nLASSO to asymLASSO under a binary outcome. Similar to Section 3.2, we set β * = (β 0 , 0 p−10 ) and generated X from a multivariate Gaussian distribution with an autoregressive covariance structure. We simulated the outcome from the following logistic regression model y|x ∼ Bernoulli{π(μ + x T β * )} where π(·) = exp(·)/{1 + exp(·)}. The intercept term μ = 0.50 corresponded to a case rate of approximately 60%. We evaluated prediction performance using the area under the curve (AUC) in a test set of n = 1,000. The results comparing asymLASSO to LASSO for the logistic regression model are displayed in Table 5 Tables 1 and 2) and we also observed consistent patterns, not reported, under different sample sizes, effect sizes, parameter dimensions, and correlation structures.

Real Data Analysis: Breast Cancer Gene Expression
BRCA1 is a DNA damage repair gene that produces tumor suppressor proteins. Pathogenic variants in BRCA1 and BRCA1 expression have been shown to have strong associations with breast and ovarian cancer risk (Welcsh et al., 2000;Welcsh and King, 2001). BRCA1 is known to interact with many other genes, particularly in response to DNA damage. In this analysis, we aimed to identify genes associated with BRCA1 expression, as such genes could implicate Gene expression data were available for 17,814 genes measured in breast cancer tissue samples from 536 women with breast cancer from The Cancer Genome Atlas (TCGA). The data are available at http://cancergenome.nih.gov and has been previously analyzed in Breheny (2019). We excluded 491 genes due to expression values missing in one or more women. Expression values of the remaining 17,322 genes were log-transformed and standardized. A broad grid between [0.05, 0.95] was used to estimate τ for asymLASSO. Similar to our simulation study, LASSO, nLASSO, and asymLASSO were performed using 10-fold cross validation. We randomly split the data into both a training (n = 357) and test (n = 179) set. Table 6 summarizes the number of selected variables and predicted R 2 for each method.
The asymLASSO approach exhibits a minor improvement in the predicted R 2 when compared to both the LASSO and nLASSO. Furthermore, the number of variables retained in the model (122) is similar to both LASSO (108) and nLASSO (126). The asymLASSO prefers selecting positive effects over negative effects (τ = 0.45; Figure 4). As a comparison, we also performed LASSO by forcing only negative coefficients in the model (non-positive LASSO). The predicted R 2 (not reported in Table 6) of the model is 0.36, which is substantially worse than asymLASSO, LASSO, and nLASSO. Thus, one can infer that the positive estimates in the model are driving the predictive performance, which is in line with what we see in Figure 4 where the cross validation error is largest for asymLASSO when τ > 0.5.
All three methods overlap in 57 of the gene expressions and nine were uniquely identified in asymLASSO (8 positive effects, 1 negative effect). Notable uniquely-identified gene expressions in this set include MND1 and JARID2, which correspond to the two largest effects in this subset. MND1 is a protein coding gene that has been shown to interact with the human oncogene GT198, which is located within the BRCA1 locus (Ijichi et al., 2000;Ko et al., 2002;Tsubouchi and Roeder, 2002;Enomoto et al., 2004Enomoto et al., , 2006. The protein coding gene JARID2 has been previously shown to be essential for the maintenance of tumor initiating cells in bladder cancer Zhu et al. (2017) and for ovarian cancer (Cao et al., 2017). Recently, JARID2 has been shown to be widely expressed in various breast cancer cell lines and patients with JARID2 mutation were shown to have a significantly shorter period of disease-free survival (Zhang et al., 2020).

Discussion
We develop a generalization to LASSO penalization that asymmetrically penalizes coefficients based on sign. We provide both a Bayesian and frequentist interpretation of our method. Under the Bayesian paradigm, shrinkage of the estimates is performed by placing an asymmetric Laplace prior on the regression coefficients. In doing so, the prior probability that a coefficient is less than (or greater than) zero is determined by the skew parameter τ ∈ (0, 1). Furthermore, the asymmetric Laplace prior corresponds to an asymmetric 1 penalty for penalized regression. To better understand the behavior of asymLASSO and its relation to the LASSO, we present a closed-form solution for the OLS model under orthonormal design. Preferential shrinkage of positive or negative effect estimates can be achieved based on the value of τ . Unlike the constrained LASSO, where constraints are predetermined, asymLASSO achieves asymmetric shrinkage through the tuning parameter τ , which can be estimated using the data. We implement our approach using cyclic coordinate descent.
Our simulations demonstrate that asymLASSO outperforms LASSO in selecting smaller signals when effect estimates are generally in the same direction for both low-and high-dimensional covariates at the expense of identifying slightly more false positives. While this may seem concerning at first, both LASSO and asymLASSO are not expected to be model selection consistent. To this end, the goal of asymLASSO is to provide a more flexible approach to the LASSO that allows for asymmetric shrinkage, potentially allowing for the discovery of smaller effects that may have been previously missed. Additionally, in the presence of mixed-sign effects, it is difficult to prefer one approach over the other. These challenges are due to the "signspecific shrinkage tradeoff" that is inherent to asymLASSO providing both advantages and disadvantages. However, in mild circumstances, we observe that the selection performance of asymLASSO can be substantially improved if we have a priori knowledge about the direction of effects and transform the design matrix accordingly. In general, while it may be difficult for practitioners to code the covariates accordingly in advance, the ability to improve selection performance by manipulating the design matrix is a unique benefit to asymmetric shrinkage when compared to the standard LASSO and non-negative LASSO.
We apply our approach to breast cancer gene expression data from the TCGA to identify genes associated with BRCA1 expression and compare its performance with LASSO. Our method identified nine genes that were not identified by either LASSO or nLASSO. Two of these genes, MND1 and JARID2, have been previously reported to be associated with BRCA1 or with breast cancer progression and provides evidence to further understand their biological relationship within the context of BRCA1 gene expression.
We envision several paths to improve asymmetric penalization. While the motivation of the asymmetric LASSO is derived from a Bayesian perspective, parameter estimation and variable selection is performed through minimizing a penalized log-likelihood. We are currently investigating the performance of the asymmetric LASSO under a fully Bayesian framework. The asymmetric 1 penalty does not overcome some of the theoretical and practical shortcomings that are well known to LASSO penalization. The LASSO has been shown to exhibit model selection consistency under strict conditions on the design matrix (Zhao and Yu, 2006). We conjecture that these results hold for the asymLASSO under certain assumptions on τ . Another approach to ensure model selection consistency is to extend asymmetric penalization to oraclebased procedures (Fan and Li, 2001). In Section 2.2, we show that asymLASSO can produce more biased estimates than the LASSO for certain coefficient estimates based on the value of τ . Similar to overcoming the bias issue for larger estimates for the LASSO, we can weight the shrinkage parameter for each coefficient differently and perform adaptive asymmetric LASSO penalization. We provide a graph ( Figure 5) of the soft thresholding function under orthogonal design for both asymLASSO (solid black line) and adaptive asymLASSO (dashed black line) which clearly shows that adaptive asymLASSO reduces the bias for larger estimates in both directions. Proving the oracle property for asymmetric versions of, for example, adaptive LASSO (Zou, 2006), SCAD (Fan and Li, 2001), and MCP (Zhang et al., 2010), will require additional conditions on τ and λ. Lastly, when dealing with high-dimensional data, strong rules (Ghaoui et al., 2010;Tibshirani et al., 2012;Zeng et al., 2021) to safely and effectively discard large number of inactive predictors have been implemented for computational efficiency. These rules have been well studied for symmetric penalties around zero and we expect that modifications to these rules can be generally implemented for asymmetric penalization.

Supplementary Material
The following supplemental material are provided: R files necessary to reproduce the simulation results reported in this manuscript, and PDF providing supplemental tables and figures and the proof of Lemma 2.1.