Weighted Orthogonal Components Regression Analysis

In the multiple linear regression setting, we propose a general framework, termed weighted orthogonal components regression (WOCR), which encompasses many known methods as special cases, including ridge regression and principal components regression. WOCR makes use of the monotonicity inherent in orthogonal components to parameterize the weight function. The formulation allows for efficient determination of tuning parameters and hence is computationally advantageous. Moreover, WOCR offers insights for deriving new better variants. Specifically, we advocate weighting components based on their correlations with the response, which leads to enhanced predictive performance. Both simulated studies and real data examples are provided to assess and illustrate the advantages of the proposed methods.


Introduction
Consider the typical multiple linear regression setting where the available data L := {(y i , x i ) : i = 1, . . . , n} consist of n i.i.d. copies of the continuous response y and the predictor vector x ∈ R p . Without loss of generality (WLOG), we assume y i 's are centered and x ij 's are standardized throughout the article. Thus the intercept term is presumed to be 0 in linear models, for which the general form is given by y = Xβ + ε with y = (y i ) and ε ∼ 0, σ 2 I n . For the sake of convenience, we sometimes omit the subscript i. When the n × p design matrix X is of full column rank p, the ordinary least squares (OLS) estimator β = (X T X) −1 X T y, as well as its corresponding predicted valueŷ(x) = x T β at a new observation x, enjoys many attractive properties.
However, OLS becomes problematic when X is rank-deficient, in which case the Gram matrix X T X is singular. This may happen either because of multicollinearity when the predictors are highly correlated or because of high dimensionality when p ≫ n. A wealth of proposals have been made to combat the problem. Besides others, we are particularly concerned with a group of techniques that include ridge regression (RR; Hoerl and Kennard, 1970), principal components regression (PCR; Massy, 1965), partial least squares regression (PLSR; Wold, 1966Wold, & 1978, and continuum regression (CR; Stone and Brooks, 1990). One common feature of these approaches lies in the fact that they first extract orthogonal or uncorrelated components that are linear combinations of X and then regress the response directly on the orthogonal components. The number of orthogonal components doesn't exceed n and p, hence reducing the dimensionality. This is the key how these types of methods approach high-dimensional or multicollinear data.
In this article, we first introduce a general framework, termed weighted orthogonal components regression (WOCR), which puts the aforementioned methods into a unified class. Compared to the original predictors in X, there is a natural ordering in the orthogonal components. This information allows us to parameterize the weight function in WOCR with low-dimensional parameters, which are essentially the tuning parameters, and estimate the tuning parameters via optimization. The WOCR formulation also facilitates convenient comparison of the available methods and suggests their new natural variants by introducing more intuitive weight functions.
We shall restrict our attention to PCR and RR models. The remainder of the article is organized as follows. In Section 2, we introduce the general framework of WOCR. Section 3 exemplifies the applications of WOCR with RR and PCR. More specifically, we demonstrate how WOCR formulation can be used to estimate the tuning parameter in RR and select the number of principal components in PCR, and then introduce their better variants on the basis of WOCR. Section 4 presents numerical results from simulated studies that are designed to illustrate and assess WOCR and make comparisons with others. We also provide real data illustrations in Section 5. Section 6 concludes with a brief discussion, including the implication of WOCR on PLSR and CR models.

Weighted Orthogonal Components Regression (WOCR)
Denote m = rank(X) so that m ≤ (p ∧ n). Let {u 1 , . . . , u m } be the orthogonal components extracted in some principled way, satisfying that u T j u j ′ = 0 if j = j ′ and 1 otherwise. Here {u j } m j=1 forms an orthonormal basis of the column space of X, C(X) = {Xa : for some a ∈ R p }. Since u j ∈ C(X), suppose u j = Xa j for j = 1, . . . , m. The condition u T j u j ′ = 0 implies that a T j X T Xa j ′ = 0, i.e., vectors a j and a j ′ are X T X orthogonal, which implies that a j and a j ′ are orthogonal if, furthermore, a j or a j ′ is an eigenvector of X T X associated with a non-zero eigenvalue. In matrix form, let U n×m = [u 1 , . . . , u m ], W m×m = diag(a j ), and A p×m = [a 1 , . . . , a m ]. We have U = XA with U T U = I m but it is not necessarily true that UU T = I n . The construction of matrix A may (e.g., in RR and PCR) or may not (e.g., in PLSR and CR) depend on the response y; again, our discussion will be restricted to the former scenario. It is worth noting that extracting m components reduces the original n × p problem into an n × m (with m ≤ n) problem, hence making automatic dimension reduction.

Model Specification
The general form of a WOCR model can be conveniently expressed in terms of the fitted vector where γ j = y, u j is the regression coefficient and 0 ≤ w j ≤ 1 is the weight for the j-th orthogonal component u j . In matrix form, (1) becomes We will see that RR, PCR, and many others are all special cases of the above WOCR specification, with different choices of {u j , w j }. For example, if w j = 1 or W = I m , then (1) amounts to the least square fitting since C(U) = C(X). This WOCR formulation allows us to conveniently study its general properties. It follows immediately from (2) that the associated hat matrix H is The resultant sum of square errors (SSE) is given by SSE = ỹ − y 2 = y T (I n − H) 2 y. Note that H is not an idempotent or projection matrix in general, neither is (I − H). Instead, From (2), the WOCR estimate of β is It follows that, given new data matrix X ′ , the predicted vector is Although not further pursued here, many other quantities and properties of WOCR can be derived accordingly with the generic form, including E β − β 2 as studied in Hoerl and Kennard (1970) and Hwang and Nettleton (2003).

Parameterizing the Weights
The next important component in specifying WOCR is to parameterize the weights in W in a principled way. The key motivation stems from the observation that, compared to the original regressors in X, the orthogonal components in U are naturally ordered in terms of some measure. This ordering may be attributed to some specific variation that each u j is intended to account for. Another natural ordering is based on the coefficients {|γ j |} m j=1 . Because of orthogonality, the regression coefficient γ j remains the same for u j in both the simple regression and multiple regression settings.
This motivates us to parameterize the weights w j based on the ordering measure. It is intuitive to assign more weights to more important components. To do so, w j can be specified as a function monotone in the ordering measure and parameterized with a low-dimensional vector λ. Two such examples are given in Figure 1. Among many other choices, the usage of sigmoid functions will be advocated in this article because they provide a smooth approximation to the 0-1 threshold indicator function that is useful for the component selection purpose and they are also flexible enough to adjust for achieving improved prediction accuracy. In general, we denote w j = w j (λ). The vector λ in the weight function are essentially the tuning parameters. This parameterization conveniently expands these conventional modeling methods by providing several natural WOCR variants that are more attractive, as illustrated in the next section. Determining the tuning parameters λ is yet another daunting task. In common practice, one fits the model for a number of fixed tuning parameters and then resorts to cross-validation or a model selection criterion to compare the model fittings. This can be computationally intensive, especially with big data. When a model selection criterion is used, WOCR provides a computationally efficient way of determining the tuning parameter λ. The key idea is to plug the specification (1) in a model selection criterion and optimize with respect to λ. Depending on the scenarios, commonly used model selection criteria include the Akaike information criterion (AIC; Akaike, 1974), the generalized crossvalidation (GCV; Golub, Heath, and Wahba, 1979), and the Bayesian information criterion (BIC; Schwarz, 1978). What is involved in these model selection criteria are essentially SSE and the degrees of freedom (DF). A general form of SSE is given by (4). For DF, we follow the generalized definition by Efron (2004): If neither the components U nor the weights w j depends on y, then DF, often termed as the effective degrees of freedom (EDF) in this scenario, is computed as With either components U or the weights w j depends on y, the computation of DF is more difficult and will be treated on a case-by-case basis. The specific forms of GCV, AIC, and BIC can be obtained accordingly. We treat the model selection as an objective function for λ. The best tuning parameterλ can then be estimated by optimization. Since λ is of low dimension, the optimization can be solved efficiently. This saves the computational cost in selecting the tuning parameter.

WOCR Examples
We show how several conventional models relate to WOCR with different weight specifications and different ways of constructing the orthogonal components U = XA and then how the WOCR formulation can help improve and expand them. In this section, we first discuss how WOCR helps determine the optimal tuning parameter λ in ridge regression and make inference accordingly. Next, we show that WOCR facilitates an efficient computational method for selecting the number of components in PCR. The key idea is to approximate the 0-1 threshold function with a smooth sigmoid weight function. Several natural variants of RR and PCR that are advantageous in predictive modeling are then derived within the WOCR framework.

Pre-Tuned Ridge Regression
The ridge regression (Hoerl and Kennard, 1970) can be formulated as a penalized least square optimization problem min β y − Xβ 2 +λ β 2 , with the tuning parameter λ. The solution yields the ridge estimator β R = X T X + λ I p −1 Xy.
The singular value decomposition (SVD) of data matrix X offers a useful insight into RR (see, e.g., Hastie, Tibshirani, and Friedman, 2009). Suppose that the SVD of X is given by where both U = [u 1 , . . . , u m ] ∈ R n×m and V = [v 1 , . . . , v m ] ∈ R p×m have orthonormal column vectors that form an orthonormal basis for the column space C(X) and the row space C(X T ) of X, respectively, and matrix D = daig (d j ) with singular values satisfying d 1 ≥ d 2 ≥ · · · ≥ d m > 0.
Noticing that X T X = VD 2 V T , the column vectors of V yield the principal directions. Since Xv j = d j u j , it can be seen that u j is the j-th normalized principal component. The fitted vector in RR conforms well to the general form (1) of WOCR, as established by the following proposition.
Proposition 3.1. Regardless of the magnitude of {n, p, m}, the fitted vectorŷ R = X β R in ridge regression can be written aŝ Proof. The proof when m = p (i.e., p < n and hence V −1 = V) can be found in, e.g., Hastie, Tibshirani, and Friedman (2009). We consider the general case including the p ≫ n scenario. With the general SVD form (9) of X, we have U T U = V T V = I m , but it is not necessarily true that UU T = I n , nor for VV T = I p .
First, plugging the SVD of X intoŷ R yieldŝ Define Then it can be easily checked thatŷ R in (11) can be rewritten aŝ One natural ordering of the principal components u j s is based on their associated singular values d j . Hence, the weight function is monotone in d j and parameterized with one single parameter λ. See Figure 1(a) for a graphical illustration of this weight function. In view of XV = UD, matrix A in WOCR is given as A = VD −1 .
Since RR is most useful for predictive modeling without considering component selection, GCV is an advisable criterion for selecting the best tuning parameterλ. With our WOCR approach, we first plugging (10) into GCV to form an objective function for λ and then optimize it with respect to λ. On the basis of (4) and (8), the specific form of GCV(λ) is given up to some irrelevant constant, by GCV has a wide applicability even in the ultra-high dimensions. Alternatively, AIC can be used instead. If lim n→∞ m/n = 0, GCV is asymptotically equivalent to AIC(λ) ∝ n ln(SSE) + 2 · EDF. The best tuning parameter in RR can be estimated asλ = argmin λ GCV(λ). Bringingλ back to β R yields the final RR estimator. Since the tuning parameter is determined beforehand, we call this method 'pre-tuning'. We denote this pre-tuned RR method as RR(d; λ), where the first argument d indicates the ordering on which basis the components are sorted and the second argument indicates the tuning parameter λ. We shall use this as a generic notation for other new WOCR models. As we shall demonstrate with simulation in Section 4.1, RR(d; λ) provides nearly identical fitting results to RR; however, pre-tuning dramatically improves the computational efficiency, especially when dealing with massive data.
Remark 1. One statistically awkward issue with regularization is selection of the tuning parameter. First of all, this is a one-dimensional optimization problem, yet done in a poor way in current practice by selecting a grid of values and evaluating the objective function at each value. The pre-tuned version helps amend this deficiency. Secondly, although the tuning parameter λ is often selected in a dataadaptive way and hence clearly is a statistic, no statistical inference is made for the tuning parameter unless within the Bayesian setting. The above pre-tuning method yields a convenient way of making inference on λ. Since the objective function GCV(λ) is smooth in λ, the statistical properties ofλ follow well through standard M-estimation arguments. However, this is not the theme of WOCR, thus we shall not pursue further.

Pre-Tuned PCR
PCR regresses the response on the first k (1 ≤ k ≤ m) principal components as given by the SVD of X in (9). The fitted vector in PCR can be rewritten aŝ where γ j = y, u j and δ j = I(j ≤ k) for j = 1, . . . , m. Clearly, PCR can be put in the WOCR form with w j = δ j . Conventionally, the ordering of principal components is aligned with the singular values {d j }; thus we may rewrite δ j = δ(d j ; c) = I(d j ≥ c) with a threshold value c = d k if k is known. Either the number of components k or the threshold c is the tuning parameter. Selecting the optimal k by examining many PCR models is a discrete process.
To facilitate pre-tuning, we replace the indicator weight δ(x; c) = I(x ≥ c) with a smooth sigmoid function. While many other choices are available, it is convenient to use the logistic or expit function Figure 1(b) plots expit{a(x − c)} with c = 50.0 for different choices of a. It can be seen that a larger a value yields a better approximation to the indicator function I(x ≥ 0), while a smaller a yields a smoother function which is favorable for optimization. In order to emulate PCR, the parameter a can be fixed a priori at a relatively large value. Our numerical studies shows that the performance of the method is quite robust with respect to the choice of a. On that basis, we recommend fixing a in the range of [10, 50].
Since PCR involves selection of the optimal number of PCs, BIC, given by BIC(λ) ∝ n ln(SSE)+ ln(n) · DF, is selection-consistent (Yang, 2005) and often has a superior empirical performance in variable selection. The hat matrix H in PCR is idempotent, so is I n − H. Thus the SSE can be reduced a little bit as y T (I n − H)y, which then can be approximated by substituting δ(d j ; c) with π(d j ; a, c). The DF can be approximately in a similar way as DF = k = j δ(d j ; c) ≈ j π(d j ; a, c). This results in the following form for BIC which is treated as an objective function of c. We estimate the best cutoff pointĉ by optimizing BIC(c) with respect to c. This is a one-dimensional smooth optimization problem with a search range Onceĉ is available, we use it as a threshold to select the components and fit a regular PCR. We denote this pre-tuned PCR approach as PCR(d; a). Compared to the discrete selection in PCR, PCR(d; a) is computationally more efficient. Furthermore, it performs better in selecting the correct number of components, especially when weak signals are present. This is an additional benefit of smoothing as opposed to the discrete selection process in PCR, as we will demonstrate with simulation.

WOCR Variants of RR and PCR Models
Not only can many existing models be cast into the WOCR framework, but it also suggests new favorable variants. We explore some of them. One first variant of PCR is leave both a and c free in (14). More specifically, we first obtain (â,ĉ) = argmin a,c BIC(a, c) and then compute the WOCR fitted vector in (1) with weight w j = exp{â(d j −ĉ)} for j = 1, . . . , m. This will give PCR more flexibility and adaptivity and hence may lead to improved predictive power. In this approach, selecting components is no longer a concern; thus GCV or AIC can be used as the objective function instead. We denote this approach as PCR(d j ; a, c). The principal components are constructed independently from the response. Artemiou andNi (2011) argued that the response tends to be more correlated with the leading principal components; this is usually not the case in reality, however. See, e.g., Jollife (1982) and Hadi and Ling (1998) for real-life data illustrations. Nevertheless, there has not been a principled way to deal with this issue in PCR. WOCR can provide a convenient solution: one simply bases the ordering of u j on the regression coefficients γ j and defines the weights w j via a monotone function of |γ j | or, preferably, γ 2 j . However, doing so will induce dependence on the response to the weights. As a result, the associated DF has to be computed differently, as established in Proposition 3.2.
Proposition 3.2. Suppose that the WOCR model (1) has orthogonal components u j constructed independently of y and weights w j = w(γ 2 j ; λ), where w(·) is a smooth monotonically increasing function and λ is the parameter vector. Its degrees of freedom (DF) can be estimated as whereẇ j = dw(γ 2 j ; λ)/d(γ 2 j ).
Proof. The WOCR model in this case isŷ = m j=1 w j γ j u j , with γ j = u T j y and w j = w(γ 2 j ; λ). It follows by chain rule that dŷ dy = n j=1 (2γ 2 jẇ j + w j )u j u T j = Udiag(2γ 2 jẇ j + w j )U T .
Following the definition of DF by Efron (2004), an estimate is given by which completes the proof.
Clearly both PCR and RR can be benefited from this reformulation. As a variant of RR, the weight now becomes w j = w(γ 2 j ; λ) = γ 2 j /(γ 2 j + λ) and henceẇ j = λ/(γ 2 j + λ) 2 . It follows that the estimated DF is The best tuning parameterλ can be obtained by minimizing GCV. Using similar notations as earlier, we denote this RR variant as RR(γ; λ). It is worth noting that RR(γ; λ) is, in fact, not a ridge regression model. Its solution can no longer be nicely motivated by a regularized or constrained least square optimization problem as in the original RR. But what really matters in these methods is the predictive power. By directly formulating the fitted valuesŷ, the WOCR model (1) facilitates a direct and flexible model specification that focuses on prediction.
Depending on whether or not we want to select components, we may fix a at a larger value or leave it free. This results in two PCR variants, which we denote as PCR(γ 2 j ; c) and PCR(γ 2 j ; a, c), respectively. Table 1 summarizes the WOCR models that we have discussed so far. Among them, RR(d j ; λ) and PCR(d 2 j ; c) resemble the conventional RR and PCR, yet with pre-tuning. Depending on the analytic purpose, we also suggest a preferable objective function for each WOCR model. In general, we have recommended using GCV for predictive purposes, in which scenarios AIC can be used as an alternative. AIC is equivalent to GCV if lim n→∞ p/n = 0, both being selection-efficient in the sense prescribed by Shibata (1981). On the other hand, if selecting components is desired, using BIC is recommended.
Remark 2. It is worth noting that the WOCR model PCR(γ 2 j ; c) has a close connection with the MIC (Minimum approximated Information Criterion) sparse estimation method of Su (2015), Su et al. (2016), and Su et al. (2017). MIC yields sparse estimation in the ordinary regression setting by solving a p-dimensional smooth optimization problem min γ n ln y − XWγ 2 + ln(n) tr(W), where W = diag (w j ) with diagonal element w j = tanh(aγ 2 j ) approximating the indicator function I(γ j = 0). Comparatively, PCR(γ 2 j ; c) solves a one-dimensional optimization problem min c n ln y − UWγ 2 + ln(n) tr(W), The substantial simplification in PCR(γ 2 j ; c) is because of the orthogonality of the design matrix U. Hence the coefficient estimates γ in multiple regression are the same as those in simple regression and can be computed ahead. Furthermore, the orthogonal regressors u j , i.e., the columns of U, are naturally ordered by γ 2 j . This allows us to formulate a one-parameter smooth approximation to the indicator function I(γ 2 j ≥ c), which induces selection of u j in this PCR variant.

Implementation: R Package WOCR
The proposed WOCR method is implemented in an R package WOCR. The current version is hosted on GitHub at https://github.com/xgsu/WOCR. The main function WOCR() has an argument model= with values in RR.d.lambda, RR.gamma.lambda, PCR.d.c, PCR.gamma.c, PCR.d.a.c, and PCR.gamma.a.c, which corresponds to the six WOCR variants as listed in Table 1. Among them, RR(d; λ), RR(γ; λ), PCR(d; c), and PCR(γ; c) involves one-dimensional smooth optimization. This can be solved via the Brent (1973) method, which is conveniently available in the R function optim(). Owing to the nonconvex nature, dividing the search range of the decision variable can be helpful. The other two methods, PCR(d; a, c) and PCR(γ; a, c), involve two-dimensional smooth nonconvex optimization. Mullen (2014) provides a comprehensive comparison of many global optimization algorithms currently available in R (R Core Team, 2018). We have followed her suggestion to choose the generalized simulated annealing method (Tsallis and Stariolo, 1996), which is available from the R package GenSA (Xiang et al. , 2013). More details of the implementation can be found from the help file of the WOCR package.

Simulation Studies
This section presents some of the simulation studies that we have conducted to investigate the performance of WOCR models and compare them to other methods.

Comparing Ridge Regression with RR(d; λ)
We first compare the conventional ridge regression with its pretuned version, i.e., RR(d j ; λ). The data are generated as follows. We first simulate the design matrix X ∈ R n×p from a multivariate normal distribution N (0, Σ) with Σ = (σ jj ′ ) and σ jj ′ = ρ |j−j ′ | for j, j ′ = 1, . . . , p. Apply SVD to extract matrix U and D. Then we form the mean response as For each simulated data set, we apply RR (as implemented by the R function lm.ridge) and RR(d; λ), both selecting λ with minimum GCV. To compare, we consider the mean square error (MSE) for prediction. To this end, a test data set of n ′ = 500 is generated in advance. The fitted RR and RR(d; λ) from each simulation run will be applied to the test set and the MSE is obtained accordingly. The 'best' tuning parameterλ is also recorded. We only report the results for the setting ρ = 0.5, σ 2 = 1, p = 100, b = (b j ) = (p, p − 1, . . . , 1) T /10. Two sample sizes n ∈ {50, 500} are considered. For each model configuration, a total of 200 simulation runs are considered.
In the simulation, we found how to specify the search points could be a problem in the current practice of ridge regression. Initially, we found the ridge regression gave inferior performance compared to RR(d; λ) in many scenarios. However, after adjusting its search range, the results became nearly identical to what RR(d; λ) had. This point will be further illustrated in Section 4.3. It is also worth noting that the minimum GCV tends to select a very small λ in the ultra-high dimensional case with p > n.
The first set (i) fits perfectly to ordinary PCR and hence PCR(d; c) with number of useful components being 5, while the second set (ii) corresponds to the situation where the response is only associated with the fifth principal components, a scenario that fits best to PCR(γ; c). Recall that the shape parameter a in both PCR(d; c) and PCR(γ; c) is fixed at a relatively larger value. Concerning its choice, we consider four values a ∈ {5, 10, 50, 100}. A total of 200 simulation run is made for each configuration. For each simulated data set, the ordinary PCR is fit with minimum cross-validated error, as implemented in R package pls while PCR(d; c) and PCR(γ; c) are fit with minimum BIC. Figure 3 plots the number of components selected by each method via boxplot and the MSE for predicting an independent test data set of n ′ = 500 generated from the same model setting via mean plus/minus standard error bar plot.
It can be seen that PCR substantially overfits in both model settings, resulting in high prediction errors as well. In the first scenario (i), PCR(d; c) and PCR(γ; c) both do well with similar performance. In the second scenario (ii), PCR(d; c) fails in identifying the correct principal components while PCR(γ; c) remains successful by switching the ordering from singular values d j to regression coefficients γ 2 j . For the different a choices, the performance of PCR(d; c) and PCR(γ; c) is quite stable with some minor variations.
The covariates of dimension p are independently generated from the uniform[0,1] distribution and the random error term follows N (0, 1). In both models, only the first five predictors are involved in the mean response function. Two choices of p ∈ {5, 50} are considered with n = 500. For each simulated data set, ridge regression, PCR, and six WOCR variants in Table 1 are applied with default or recommended settings. In particular, we fix the scale parameter a = 50 in PCR(d; c) and PCR(γ; c).
To apply ridge regression, we have used λ ∈ {0.01, 0.02, . . . , 200}. Table 2 presents the prediction MSE (mean and SE) and the median number of selected components by each method, out of 200 simulation runs. First of all, it can be seen that the ridge regression appears to provide the worst results in terms of MSE. This is because of deficiencies involved in the current practice of ridge regression that computes ridge estimators for a discrete set of λ within some specific range, which may not even include the true global GCV minimum. Comparatively, RR(d; λ) provides a computationally efficient and reliable way of finding the 'best' tuning parameter. We could have refit the ridge regression according toλ suggested by RR(d; λ). Another interesting observation is that RR(γ; λ) tends to give more favorable results than RR(d; λ), because sorting the components according to |γ j | borrows strength from the association with the response.
Among PCR variants, neither PCR(d; c) nor PCR(γ; c) performs well. On the basis of BIC, they are aimed to find a parsimonious true model when the true model is among the candidate models, which, however, is not the case here. In terms of prediction accuracy, it can be seen that RR(γ; λ), PCR(d; a, c), and PCR(γ; a, c) are highly competitive, all yielding similar performance to PCR. Note that PCR determines the best tuning parameter via 10-fold cross-validation, while PCR(d; a, c), and PCR(γ; a, c) are based on a smooth optimization of GCV and hence are computationally advantageous. In these simulation settings, PCR has selected all components and hence simply amounts to the ordinary least square fitting.    For further illustration, we apply WOCR to two well-known data sets, which are BostonHousing2 and concrete. The Boston housing data relates to prediction the median value of owner-occupied homes for 506 census tracts of Boston from the 1970 census. We used the corrected version BostonHousing2 available from R package mlbench (Leisch and Dimitriadou, 2012), with dimension n = 506 observations and p = 17 predictors. The concrete data is available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/). The goal of this data set is to predict the concrete compressive strength based on a few characteristics of the concrete. The data set has n = 1, 030 observations and p = 8 continuous predictors. Figure 4 plots the singular values d j and the regression coefficients in absolute value |γ j | for both data sets. It can be seen that d j decreases gradually as expected. The bar plot of |γ j |, however, shows different patterns. In the BostonHousing2 data, the very first component is highly correlated with the response, while others shows alternate weak correlations. In the concrete data, the third component is most correlated with the response, followed by the 6th and 5th principal components. The first two components are only very weakly correlated. This data set shows a good example where the top components are not necessarily the most relevant components in terms of association with the response. To compare different models, a unified approach is taken. We randomly partition the data into the training set and the test set with a ratio of approximately 2:1 in sample sizes. The training set is used to construct models and then the constructed models are applied to the test set for prediction. The default settings in Table 1 are used for each WOCR, while the default 10-fold CV method is used to select the best model for ridge regression and PCR. We repeat this entire procedure for 200 runs. The prediction MSE and the number of components for every method is recorded for each run. The results are summarized in Table 3. While most methods provide largely similar results, some details are noteworthy. For ridge regression, RR(d; λ) outperforms the original ridge regression slightly but it is much faster in computation time. Comparatively, RR(γ; λ) improves the prediction accuracy by basing the weights on γ j 's for the concrete data, where the top components are not the most relevant to the response as shown in Figure 4. Among the PCR models, both PCR(d; a, c) and PCR(γ; a, c) are among top performers in terms of prediction.
Neither PCR(d; c) nor PCR(γ; c) perform as well as others in terms of prediction accuracy owing to their different emphasis. Concerning component selection, PCR(γ; c) yields simpler models than PCR(d; c) and PCR. This is determined by the nature of each method and data set. Referring to Figure 4, PCR(γ; c) clearly helps extract parsimonious models with simpler structures.

Discussion
We have proposed a new way of constructing predictive models based on orthogonal components extracted from the original data. The approach makes good use of the natural monotonicity associated with those orthogonal components. It allows efficient determination of the tuning parameters. The approach results in several interesting alternative models to RR and PCR. These new variants make improvement on either predictive performance or selection of the components. Overall speaking, RR(γ; λ), PCR(d; a, c), and PCR(γ; a, c) are highly competitive in terms of predictive performance.