Do Predictor Envelopes Really Reduce Dimension?

Predictor envelopes model the response variable by using a subspace of dimension d extracted from the full space of all p input variables. Predictor envelopes have a close connection to partial least squares and enjoy improved estimation efficiency in theory. As such, predictor envelopes have become increasingly popular in Chemometrics. Often, d is much smaller than p, which seemingly enhances the interpretability of the envelope model. However, the process of estimating the envelope subspace adds complexity to the final fitted model. To better understand the complexity of predictor envelopes, we study their effective degrees of freedom (EDF) in a variety of settings. We find that in many cases a d-dimensional predictor envelope model can have far more than d + 1 EDF and often has close to p + 1. However, the EDF of a predictor envelope depend heavily on the structure of the underlying data-generating model and there are settings under which predictor envelopes can have substantially reduced model complexity.


Introduction
first introduced predictor envelopes as a method of sufficient dimension reduction for "large p" settings, where p denotes the number of input variables. Cook, Helland, and Su (2013) further developed the predictor envelope model, revealing its connection to partial least squares (PLS) and proving the efficiency gain of predictor envelopes over PLS. The predictor envelope model has become increasingly popular in Chemometrics-see Cook and Forzani (2020) which was published as the cover story of the Journal of Chemometrics. A substantial amount of work has been devoted to envelope models. We refer readers to Cook (2018) for a comprehensive treatment of the subject. The predictor envelope model uses a subspace of dimension d extracted from the full space of p input variables, drawing on the response variable to find this d-dimensional subspace which is referred to as the envelope. Compared with the full model using all p variables, a predictor envelope model can achieve substantial efficiency gains. It is intuitive for users to interpret a predictor envelope with a d-dimensional subspace as having complexity d + 1 (the extra one corresponding to the intercept term in the model), but this could be misleading. For instance, best k-subset regression finds a subset regression model with k input variables that has the smallest residual sum of squares. The model complexity of best k-subset regression can be far greater than k because those k "best" input variables are chosen based on the data and the response variable is involved in the selection-see the simulation results in Janson, Fithian, and Hastie (2015). The same argument applies to predictor envelopes. If we knew the envelope subspace then the model complexity would be equal to that of fitting a linear model with d features. In reality, we have to estimate the envelope subspace, so the actual model complexity can be different. To rigorously quantify this phenomenon, we use a more general measure of model complexity: the effective degrees of freedom (EDF) developed by Efron (1986) based on Stein's unbiased risk estimation theory. Critically, the EDF can be used to correct the downward bias of the training error as an estimator of in-sample prediction error. As such, accurate estimates of the EDF can prove invaluable to researchers seeking to compare and assess models' prediction performance.
In this paper, we explore the complexity of predictor envelopes as measured by their EDF. Due to the relatively complex nature of the predictor envelope estimator (See Section 2.2 for details), the analytical expression of its EDF is difficult to obtain. We opt to conduct a Monte Carlo study of the EDF under various model settings. Since linear models are the foundation for predictor envelopes, we limit ourselves to linear models: Y = Xβ + where Y ∈ R is the response variable, X ∈ R p represents the set of covariates, β is the true regression coefficient and is the error term. We find that the number of predictors p and the dimension d of the envelope do not entirely determine an envelope's EDF. Rather, the structure of the predictors X and their relationship with the response Y influence the range of values that the EDF can take. On the one hand, in many but not all of our experiments, predictor envelopes are as complex as saturated linear models fit to the same p predictors, with p + 1 EDF. On the other hand, we identify two scenarios in which predictor envelopes indeed have much smaller EDF compared with the saturated linear model: (1) when the predictors that explain the variability in Y are highly correlated and (2) when the covariance matrix of X has a few dominating eigenvalues and the true data-generating β is linearly dependent on at least one of their corresponding eigenvectors. The EDF reductions in the first scenario are non-trivial, but fairly modest. The EDF reductions in the second scenario are far larger: a predictor envelope with the right settings can have EDF near d + 1, which suggests that the intuitive answer is correct in this scenario.
In Section 2 we briefly review the effective degrees of freedom and predictor envelopes. In Section 3 we discuss the ordinary Monte Carlo method for numerically computing the EDF as the design of our study at a high level. More concrete simulation models are described in Sections 4 and 5, where some "negative" and "positive" messages are reported, respectively. In Section 6 we confirm the same findings by using input variables from a real data example.

Effective Degrees of Freedom
Suppose we fit a model to n observations where Y i = f (X i ) + i and the i have mean 0 and variance σ 2 , and that this model gives us estimates Y i . Ideally, we would like our model to make accurate predictions on new data. A model with a smaller training error may not have a smaller in-sample predictor error. It turns out that there is a model complexity term to the training error that can help correct the bias. The following theorem from Efron (1986) provides a foundation for doing so.
Theorem 1 (Efron, 1986). Let Y i = f (X i ) + i where the i are independent identically distributed (i.i.d.) with mean 0 and variance σ 2 for i = 1, . . . , n. Suppose that Y i are estimates of f (X i ) from a model fit using the observed Y i . Lastly, letỸ i be new observations with the same covariate values X i . Then Define the effective degrees of freedom for any model-fitting method FIT γ with tuning parameter γ as follows: Definition 1 (Effective Degrees of Freedom). Let X ∈ R p and Y = f (X) + , where E( ) = 0 and Var( ) = σ 2 . Suppose we have n observations (Y i , X i ) and that we fit a model using FIT γ . We define the effective degrees of freedom (EDF) for this model to be An immediate corollary to Theorem 1 is that Thus we see that 2 df(X, f, FIT γ ) plays the same role as 2 df in Mallows' C p statistic (Mallows, 1973) for linear regression. In this way, the EDF provide a meaningful measure of model complexity and serve as a generalization of the degrees of freedom for a linear regression model. When FIT γ is a linear smoother, meaning Y = SY for some smoothing matrix S that does not depend on Y, we have a closed-form expression for the EDF: df(X, f, FIT γ ) = tr(S). For more sophisticated models, the analytical expression of the EDF can be difficult to obtain.

Predictor Envelopes
In this section we briefly review the predictor envelope estimation procedure introduced by Cook, Helland, and Su (2013) and further developed by Cook, Forzani, and Su (2016). Suppose we have a linear model In this setting, X is a random vector with E[X] = μ x and Var(X) = X , E[ ] = 0 and Var( ) = σ 2 , and is independent of X denoted by ⊥ ⊥ X. Before we delve into the details of how envelopes achieve predictor reduction, we need to establish some notation. Suppose we have a matrix M ∈ R p×p and a subspace S ⊆ R p . We define the M-inner product as s, t M = s T Mt. We let P S(M) denote the projection onto S in the M-inner product and let Q S(M) = I − P S(M) . Note that when projecting onto a subspace in the usual inner product, we drop M from our notation. Lastly, let S ⊥ denote the orthogonal complement of S in the usual inner product. Suppose that we find a subspace S ⊆ R p such that (i) corr(Q S X, P S X) = 0 and (ii) corr(Y, Q S X|P S X) = 0. Conditions (i) and (ii) tell us that P S X provides all of the information about Y that we can extract linearly from X while Q S X does not provide us with any additional information about either Y or P S X. Cook, Helland, and Su (2013) proved that when (1) is the true model, conditions (i) and (ii) are equivalent to (a) X S ⊆ S and X S ⊥ ⊆ S ⊥ and (b) span(β) ⊆ S, respectively. The smallest subspace S satisfying these conditions is called the X -envelope of B = span(β) and is denoted by E X (B). The term "envelope" comes from the fact that E X (B) "envelopes" the column space of β in satisfying condition (b).
Let d = dim(E X (B)), where d < p, and ∈ R p×d be a semiorthogonal basis matrix for E X (B). If we know , then we can express (1) as and estimate μ ∈ R and α ∈ R d as in standard linear regression with the reduced predictors T (X − μ x ). In practice, however, we need to estimate the basis matrix . Let C = (X T , Y T ) T and let S C denote the sample version of C = Var(C). Cook, Helland, and Su (2013) used a likelihood-based approach to estimate C , minimizing the objective function The structure of E X (B) allows us to express C as a function of and a few other parameters. Let S X denote the sample covariance matrix of X and s 2 Y denote the sample of variance Y . Define Z = s −1 Y Y and let S XZ denote the sample covariance between X and Z. After minimizing F d (S c , C ) over every parameter except , we only need to minimize giving us the estimatorˆ = arg min For this study, we use the R package Renvlp (Lee and Su, 2020), which findsˆ using the algorithm outlined by Cook, Forzani, and Su (2016). In practice we do not know d = dim(E X (B)) and therefore treat d as a modeling parameter. Throughout this paper, we use Env d to denote the predictor envelope model fitting procedure with the envelope dimension set to d.
The analytical expression of the EDF for a predictor envelope (df(X, f, Env d )) is difficult to obtain. Hence, in this study we use Monte Carlo to numerically compute the EDF of predictor envelopes in order to gain some insight into their complexity in different settings. We anticipate that df(X, f, Env d ) will vary considerably with d. If we set d = p, then the predictor envelope has no reduction and df(X, f, Env p ) = p + 1. In actual applications of using predictor envelopes, d is often a small number (d p). In such settings, we expect that df(X, f, Env d ) will fall between d + 1 and p + 1. If we knew beforehand and fit a linear model to T X, that linear model would have d + 1 degrees of freedom. As such, we can think about decomposing df(X, f, Env d ) into the d + 1 degrees of freedom used to fit a linear model toˆ T X and the additional degrees of freedom used in estimating the envelope.

Computing the EDF via Monte Carlo
We use Monte Carlo to numerically compute df(X, f, FIT γ ) for a given design matrix X ∈ R n×p , data-generating model f , and model-fitting method FIT γ . This is the standard approach for studying the EDF of a model when an analytical expression is not available (see, also, Janson, Fithian, and Hastie (2015)). We create M datasets with X, f , and a pre-specified error variance Var( ) = σ 2 as follows: for each j ∈ {1, . . . , M} we generate ij Throughout this study we set M = 400 and σ 2 = 1.
We use different design matrices X across our simulations. Though we limit ourselves to linear data-generating models, f (X) = Xβ, which is the foundation for predictor envelopes, we vary β across simulations as well. We want to understand how the structure of X and f impact the EDF rather than simply find the EDF for specific X and β. As such, we run each case 100 times, using the same simulation settings to generate different design matrices X k and different coefficient vectors β k for each of the 100 iterations. This gives us 100 different estimates df(X k , β k , FIT γ ) from design matrices and true coefficient vectors generated using the same settings. Throughout Sections 4, 5, and 6 we provide the mean and 1-standard deviation bars for the EDF for each set of simulation settings. To emphasize the role of the underlying structure of the data in shaping the EDF we report the mean of the estimates for each combination of settings for X and β: df(X, β, FIT γ ) = 1 100 100 k=1 df(X k , β k , FIT γ ).

Envelope Complexity in Different Cases
In this section, we examine three cases to find how the structure of the joint distribution of (X, Y ) and the envelope dimension impact df(X, f, Env d ): Case 1. X has a compound symmetric covariance matrix. Case 2. X has a block covariance matrix with 5 predictors in each block. Case 3. X has an order-1 auto-regressive (AR1) covariance matrix. Let ρ = Cov(X). We can express the first three correlation structures as follows: We vary parameter ρ from 0 to 0.8 in each case. In all three cases, we generate i.i.d. X i ∼ N(0, ρ ) and i ∼ N(0, 1) with p = 50 and n = 150. In the compound symmetric and AR1 cases, we generate β j ∼ Gamma(2, 2) for 5 randomly selected j ∈ {1, . . . , p} and set the 45 remaining β j to 0. In the block covariance case, we generate β j ∼ Gamma(2, 2) for j ∈ {1, . . . , 5}, so that the significant predictors all fall in the same block, and set the 45 remaining β j to 0. In all three cases, we fit predictor envelopes with envelope dimensions ranging from d = 1 to 10.  Figure 1 shows EDFs for predictor envelopes in the three cases outlined above. These plots include 1-standard deviation bars to show the variation across Monte Carlo iterations with different β k . We see modest EDF reductions when d = 1, X has either a compound-symmetric or block covariance structure, and the informative predictors are highly correlated. All three of these conditions appear to be necessary here, as the envelopes' EDF reductions quickly disappear when either the informative predictors are not as strongly correlated or d > 1. Moreover, we can see that there is a considerable amount of variation in the EDF in the cases where we see some reduction, indicating that β k also affects df(X k , β k , Env d ). It is precisely because of this variation that we choose to report the average EDF across X k and β k generated with the same settings: if we picked a single X and β to use across all 100 iterations, we might not be able to generalize beyond that case.
To ensure that these results are not limited to one choice of n and p, we rerun the compound symmetric case with n = 200 and p = 100. Figure 2 shows that the EDFs for predictor envelopes in this case follow the same pattern as when n = 150 and p = 50.
While there are a couple of scenarios in which EDF reductions are worth highlighting, we should not overlook that df(X, β, Env d ) is approximately p + 1 in most of the settings. In the AR1 simulations, df(X, β, Env d ) ≈ p+1 for all values of d, with little variation across simulations with different β k . This suggests that the additional degrees of freedom used in estimating the envelope subspace E X (B) are nearly p − d in these cases. Returning to Definition 1, this tells us that the algorithm used to estimate E X (B) is relying heavily on information from the response, as the fitted values Y i and the observed Y i are highly correlated. In the compound symmetric and block correlation cases, on the other hand, the algorithm relies less on information from the response and more on information from the correlation structure of the predictors, leading to lower EDFs.

Settings for Reducing Envelope Complexity
Given the relatively "negative" message from Section 4, it is worth exploring whether there are scenarios in which a fitted predictor envelope model truly has substantially reduced EDF. In this section we consider the following set of three conditions: (1) a few eigenvalues of Cov(X) = X dominate its remaining eigenvalues, (2) the envelope dimension d is no bigger than the number of eigenvectors with dominating eigenvalues, and (3) the true data-generating β is linearly related to a least one of the eigenvectors with a dominating eigenvalue. We recall from Section 2.2 that E X (B) is the smallest subspace S such that (a) X S ⊆ S and X S ⊥ ⊆ S ⊥ and (b) span(β) ⊆ S. A subspace A generated by orthogonal eigenvectors of X satisfies (a) (Cook, Helland, and Su, 2013). If β falls in the space generated by those eigenvectors, then (b) is satisfied as well. In this case, the envelope subspace E X (B) falls within A and the true envelope dimension d * is equal to the number of generating eigenvectors. As such, we anticipate that the EDF of predictor envelopes can drop to near d + 1 when all three of our conditions are satisfied.
We pick these five cases and the settings we vary within them to highlight a few meaningful contrasts. Let V denote the set of eigenvectors of X with dominating eigenvalues and let W denote the set of eigenvectors of X on which β is linearly dependent. With this notation, we can express the conditions under which we see substantial EDF reductions succinctly: (1) V = ∅, (2) d |V |, and (3) V ∩ W = ∅. We can verify the importance of the first two conditions by focusing on how df(X, β, Env d ) varies with λ 1 and d within each of the first four cases, as λ 1 and d determine whether V = ∅ and d |V |. We must compare results across cases to establish the importance of the third condition, as only Case 5 is designed so that V ∩ W = ∅. Figure 3 shows the simulation results for the five cases outlined above. We start by focusing on the first four cases. A clear trend emerges among the simulations for which λ 1 = 100: df(X, β, Env d ) is near d + 1 when d |V | and near p + 1 for d > |V |. There is a clear fault line between d = |V | and d = |V | + 1 in all four of these cases. We see that df(X, β, Env d ) is constant for d |V | in both Case 2 and Case 3 and that df(X, β, Env d ) increases slowly in d past d = |V | + 1. We see similar patterns for λ 1 = 5, 10, and 25, though the jumps in the EDF between d = |V | and d = |V | + 1 are less dramatic.
Intuitively, we might also expect that the number of eigenvectors of X which provide us with information about β would similarly play a major role in determining the EDF for a predictor envelope. In Cases 3 and 4, however, we do not see major changes in df(X, β, Env d ) between d = |W | and d = |W |+1. Thus |V |, rather than |W |, appears to be the salient threshold for the envelope dimension when it comes to EDF reduction.
Next, we focus on making comparisons across different λ 1 for fixed d. We see that the relative size of λ 1 determines how far df(X, β, Env d ) drops below p + 1 for d ∈ {1, . . . , |V |}. That is, the difference between df(X, β, Env |V | ) and df(X, β, Env |V |+1 ) grows as the gap between λ 1 and λ |V |+1 increases. In fact, when λ 1 = 1 predictor envelopes appear to hardly deliver any EDF reductions at all. Thus it seems necessary that V = ∅ for predictor envelopes to achieve substantial EDF reductions.
Lastly, we focus on the role of the eigenvectors of X which provide us with information about β. We set V ∩ W = ∅ in Case 5. Here we see that even when d = |V | = 1, df(X, β, Env d ) is only slightly below p + 1. Contrast this result with Case 1. In both cases |V | = |W | = 1, yet we see dramatically different behavior when d = 1. This comparison suggests that at least one of the eigenvectors in V needs to provide some information about β for envelopes to achieve substantial EDF reductions.
As in the previous section, we rerun one of our simulations to ensure our findings generalize beyond our initial choice of n and p. Figure 4 shows the EDFs for predictor envelopes when n = 200, p = 100, and X and Y are generated as in Case 1. The overall pattern remains the same.

A Real Data Example
In this section, we turn to a real data example to study the EDF of predictor envelopes. Our data set characterizes the student populations of Minneapolis district schools in 1972 and contains n = 63 observations and p = 9 predictors (Cook, 1998).
In our first set of simulations, we only use the design matrix from this data set. We simulate our response variable from a linear model as described in Section 3, using different coefficient vectors β k to generate Y in each of 100 simulation iterations. In these simulations, we generate β 1 , . . . , β 5 ∼ Gamma(2, 2) and set β 6 = · · · = β 9 = 0. For our second set of simulations, we first fit an envelope model with dimension 1 to the complete data, then use the estimated regression coefficientsβ and response varianceσ 2 from that model as the "true" data-generating values for Monte Carlo simulations. In doing so, we treat the fitted envelope model as the "true" model for these simulations. In both sets of simulations, we fit predictor envelopes with envelope dimensions from d = 1 to 5. Figure 5 shows the results from the first set of simulations. We see that df(X, β, Env d ) is about 10 regardless of the envelope dimension. These results align with our findings from Section 4 for settings in which the significant predictors are not highly correlated. In the results for the second set of simulations, we see from Figure 6 that df(X, β, Env d ) ≈ 2 for a predictor envelope with dimension 1, mirroring the results from the ideal setting simulations in Section 5. As in the ideal simulations, the EDF remains below p + 1 for models with higher envelope dimensions as well. These results reveal that when the "true" model for (X, Y ) has an envelope structure (as it does in this case because we generated Y from such a model), estimating the envelope adds few degrees of freedom to the overall fitting procedure.

Discussion
We have found that while the fundamental idea of predictor envelopes is to reduce the full space of the original p input variables down to a much-reduced subspace T X ∈ R d for the purpose of modeling Y , the actual estimation process tends to have a dramatic impact on the complexity of the fitted predictor envelope model. In many reasonable settings predictor envelopes tend to have far more than d + 1 effective degrees of freedom. In fact, the EDF is often close to p + 1. This suggests that under those settings estimating the subspace T X adds nearly p − d degrees of freedom to the overall fitting procedure. By comparing the EDF across several simulation settings, we have found that the structure of the joint distribution of (X, Y ) plays a critically important role in the EDF.
The scenarios identified in Section 5 gave us results close to the usual perception that the complexity is d + 1, suggesting that the process of estimating E X (B) adds few degrees of freedom in these settings. In particular, we found that the EDF is close to d + 1 when the "true" model is a predictor envelope and d d * , the true envelope dimension. When we set an envelope model with dimension 1 as the "true" model for simulations in Section 6 we saw similar results: predictor envelopes with d = 1 used only 2 degrees of freedom. In this regard, predictor envelopes follow a pattern seen in other dimension reduction techniques. Mukherjee et al. (2015) identified a similar phenomenon for reduced-rank estimators: their unbiased estimator for the EDF approaches the "naïve estimator," the number of free parameters in the model, when the model rank r is close to the true underlying rank r * .
We know from Theorem 1 that the EDF measures how much the training RSS differs from the in-sample test RSS. As we have found, EDFs for predictor envelopes (and therefore the corrections needed for the training RSS) are often much larger than one might assume. We compute the EDF, RSS and in-sample test RSS for predictor envelopes in a few of our simulation studies. Figures 7 and 8 plot these values for the compound symmetric case with ρ = 0.8 (from Section 4) and the first reduction case with λ 1 = 25 (from Section 5), respectively. In both cases the RSS provides a gross underestimate of the in-sample test RSS. Moreover, in the latter case we see that different values of d minimize the RSS and the in-sample test RSS. The large EDF can be interpreted as the high complexity of the fitted envelope model, which implies high  estimation variance contributing to the high test RSS. These results underscore that users need accurate estimates of the EDF to correctly compare and assess predictor envelopes.
Lastly, we point out that the EDF of partial least squares (PLS) was examined in Krämer and Sugiyama (2011) where it was observed that the EDF can be far greater than d + 1 (here d denotes the number of components used in PLS) and often can be close to p + 1. As mentioned earlier, predictor envelope models and PLS are related in the sense that they share the same population target, though the former uses a likelihood approach for estimation and has improved efficiency over the latter. Our findings that the correlation structure of the design matrix shapes the EDF of predictor envelopes align with the findings in Krämer and Sugiyama (2011).

Supplementary Material
Code and data for reproducing our results can be found at https://github.com/TateJacobson/ Envelope-EDF. This repository contains the following folders: • Cleaning Output: Contains an R script for cleaning saved simulation output and generating plots from it. • edf: An R package for computing the effective degrees of freedom • Simulations: Contains R scripts for the simulations run in "Do Predictor Envelopes Really Reduce Dimension?"