An Modified PLSR Method in Prediction

Among many statistical methods for linear models with the multicollinearity problem, partial least squares regression (PLSR) has become, in recent years, increasingly popular and, very often, the best choice. However, while dealing with the predicting problem from automobile market, we noticed that the results from PLSR appear unstable though it is still the best among some standard statistical methods. This unstable feature is likely due to the impact of the information contained in explanatory variables that is irrelevant to the response variable. Based on the algorithm of PLSR, this paper introduces a new method, modified partial least squares regression (MPLSR), to emphasize the impact of the relevant information of explanatory variables on the response variable. With the MPLSR method, satisfactory predicting results are obtained in the above practical problem. The performance of MPLSR, PLSR and some standard statistical methods are compared by a set of Monte Carlo experiments. This paper shows that the MPLSR is the most stable and accurate method, especially when the ratio of the number of observation and the number of explanatory variables is low.


Introduction
In automobile market, the auction price of two-year-in-service vehicle is an important indicator of that vehicle's market value, which is of great interest to manufacturers, dealers, financial institutions and consumers.When linear model is used to predict the auction price, multicollinearity arises.Multicollinearity often exists when the number of explanatory variables is large compared to the number of observations, and it causes difficulty estimating parameters.
To solve multicollinearity problem, many statistical methods have been suggested.The variable subset selection method (VSS) is used to avoid the multicollinearity caused by too many variables, and the stepwise version is used here.The ridge regression (RR) was suggested by Hoerl and Kennard (1970) as a method for stabilizing regression estimates in the presence of multicollinearity, which assumes that the regression coefficients are not likely to be very large.Principal components regression (PCR), introduced by Massy (1965), tries to reduce the dimension and avoid multicollinearity by using just a few components, the linear combinations of the explanatory variables.
Being a comparatively new method, the partial least squares regression (PLSR) has became the most popular regression method in chemometrics.PLSR was suggested by Wold (1975), Wold et al. (1984), Martens (1985Martens ( , 1989)), Helland (1988) and Garthwaite (1994).The PLSR can be traced from general systems-analysis methods of Wold.It is a useful tool when multicollinearity exists among explanatory variables and when the number of explanatory variables is very large compared to the number of observations.PLSR has been studied in great details.Frank (1993) and Goutis (1996) proved properties of PLSR estimates.Ruscio (2000) studied the relationship between the PLSR algorithm for univariate data and Cayley-Hamilton polynomial expression.Stone (1990) introduced continuum regression based on OLS, PLSR and PCR.Goutis (1996) introduced a modification of PLSR using a roughness penalty.Wold (1992) and Durand (1997) extended PLSR into nonlinearity using spline functions.Presently PLSR have been applied in many fields, especially broadly in chemistry as "the use of mathematics and statistics on chemical data" (Martens, 1989).PLSR have been compared with other methods in chemistry, see Phatak (1993), Phatak (1997) and Ter Braak (1998).PLSR was also combined with neural network as a new subject in nonlinear analysis (Ham, 1997).The software for the PLSR regression is available in some packages such as Unscrambler 7.5 (a PLSR and experimental design software), SAS and SIMCA 8.0 (a PLSR software).
When the four methods (PLSR, RR, PCR and VSS) are used to predict the auction price referred in the first paragraph, although the algorithms of PLSR and RR reach better results compared to the very large average relative errors from using VSS and PCR, their performances on five different vehicle lines are unstable, and therefore unsatisfactory, despite the fact that the five vehicle lines have very similar position in automobile market.
While studying this practical question, we discovered the reason behind the unstable performance of PLSR and developed the more stable modified partial lease square regression (MPLSR), a modification of PLSR.This paper is organized as follows.Section 2 provides the background of the practical problem, predicting auction price, and the results from using the four methods (PLSR, RR, PCR and VSS).Section 3 presents the idea of the PLSR method and analyzes its shortcoming, which motivates us to introduce a new MPLSR method.Section 4 introduces the algorithm of the new MPLSR method.Section 5 applies the MPLSR in our practical problem on automobile market and compares its prediction results with those of other four methods.Section 6 presents a simulation study to compare the performance of MPLSR, PLSR, VSS, RR and PCR.

Automobile Market Prediction Results
In the automobile market, the auction price of a two-year-in-service (2YIS) vehicle is of special interest because it is the base of many important decisions.For example, it is used to calculate the lease end value.When a consumer leases a vehicel on January 2005 for 2 years, he will return the vehicle on January 2007.The manufacturer suggested price minus the lease end value is his payment for 2 years lease.In this case, manufactor needs to know the auction price of a 2YIS vehicle on January 2007.The auction price of 2YIS vehicles is highly correlated with the quality of the vehicle.A vehicle with good style and durabality will be fetched a good price.On the other hand, if a vehicle is a trouble maker, it has less chance to be sold at a good price.The Compact Utility segment is one of the most popular segments in the United States.It has attracted a lot of attention recently.We select five major vehicles from this segment: Explorer, 4Runner, Grand Cherokee, Cherokee and Blazer.Our goal is to predict their auction price of 2YIS vehicles.
The data used in the study includes the auction prices and twenty major factors (indexes) including APEAL Score (APEAL is referred to as Automotive Performance, Execution and Layout) measuring an owners* delight with the design and features of their vehicle, customer satisfactory indexes, durability indexes, money against market (incentive), manufacturer suggest price (MSRP), style age and used-car consumer price index (UCPI).On the auto market, the manufacturers modify their vehicles and introduce new model year vehicles each year.For example, on October 1998 the manufacturers introduced 1999 model year vehicles that are modified based 1998 model year vehicles.Most modifications are minor.But some modifications are major that are called major refreshing.In major refreshing the exterior styling and interior styling are changed.The style age equals current model year subtract the model year of last major refreshing.
The OLS method is used at first and only three independent variables are significant under the t-test with the coefficient of determination R-square greater than 0.8 while the all twenty independent variables are present.If only the three significant variables are used as independent variables, the R-square is only 0.3.Collinearity is naturally suspected, and among the twenty condition indexes, nine are larger than 60 and one is greater than 1000.Collinearity can be also expected by just looking the original meanings of these independent variables.For example, the vehicle lines with higher APEAL and better durability will have higher MSRP, lower incentive and higher customer satisfactory index.
Our study particularly covers two-year-in-service leased vehicles.The auction prices are the auction prices from automakers to dealers.Linear model is built on this data set.The auction price of one kind of vehicle line from January 1995 to June 1999 is the response variable, and all twenty other major variables of this kind of vehicle line from the corresponding two-year-in-service periods, which is from January 1993 to June 1997, are explanatory variables.Here, the auction price on January 1995 and the values of other variables on January 1993 are from the same batch of vehicles because the auction price of a new vehicle produced on January 1993 become available only after two years.This linear model is for capturing the relationship between vehicle's attributes and its auction price two years later.
The Regression ARIMA is the first model we try to use in this study.But this is a long term (24 months) forecast and the multicollinearity makes the problem complicated, the time series method doesn't have an advantage.
The methods of VSS(stepwise), RR, PCR and PLSR are natural candidates.Firstly all the four methods are applied to the five kinds of compact utility vehicles, and the prediction results are analyzed.
The monthly average of auction prices from July 1999 to December 2000 (18 months), not used in regression, are used to verify the prediction result.For analyzing the prediction results, we calculate the errors between predicted and the actual auction price.The average of the relative prediction errors (ARE) 18ȳ) is used as a criterion to test the predicting capability of a model.Here, ȳ is the mean of the auction prices from January 93 to June 99, y t is the actual auction price and ŷt is the predicted auction price.The relative errors (y t − ŷt )/ȳ, t = 1, ..., 18 from using the four methods are plotted together for comparison.Because the ARE is a commonly used index in practical, it is used here, while we will use a similar measure, statistic average prediction squared error (PSE), in Section 6.Table 1 shows the average relative prediction errors of the five kinds of SUV vehicle lines using all the four methods.The results show that the PLSR is the best method among five methods.Here, the one-at-a-time cross validation is used to select the cut-off place of the PCR.The independent matrix is standardized in RR.For the detail information about PLSR, please see the Appendix.Since one of the important assumptions of RR is that the regression coefficients are not likely to be very large.So the ridge parameter is usually selected from the range [0,1] in practical problems.In Table 1, the ridge parameters are the optimal ones in [0,1], which provide the lowest ARE.From Table 1, the other three methods produce larger predicting errors than the PLSR on average.
However, the performance of PLSR is inconsistent among the five kinds of vehicle lines.It obtains satisfactory prediction result for first three kinds of vehicle lines but not the last one.When PLSR method is used, the prediction auction prices of Blazer have large bias from its actual value.It is this case that causes our attention.Discovering the reason that the PLSR becomes inefficient in this case may lead to the key of overcoming the shortcoming of PLSR method.Figure 1 shows the relative errors in predicting auction prices of the five kinds of SUV vehicle lines.From Figure 1, the predicted values of Blazer from all the four methods are much higher than the actual auction value.This inefficiency of all methods may be caused by the irrelevant-to-the-response information contained in explanatory variables during the prediction period.

The Situation Where the PLSR Method Does Not Work Well
As we have seen in the last section, PLSR method does not work well in all situations.It provides a very inaccurate prediction of auction price for Blazer although the prediction results of other 4 vehicle lines are reasonable.This phenomenon is caused by the irrelevant information in the explanatory matrix to the response variable.
Let the linear model (here only univariate response is considered) be where Y is an n × 1 response vector and X a known n × k explanatory matrix, and is a noise term with the same dimensions as Y. Matrix X of explanatory variables contains two types of information.One type is relevant to the response variable Y and therefore useful in predicting the value of Y.The other type is irrelevant to Y and hence causes inefficiency in the prediction.The idea of PLSR algorithm is to extract components (factors) {t i } from X, which are relevant to Y.These components are extracted in decreasing order of relevance measured by covariance Cov(t i , Y ).Let T be the matrix of the selected components t i 's, and therefore T = XW, where the columns of W are weight vectors for the X columns.Then ordinary least squares procedures for the regression of Y on the matrix T are performed to produce the coefficient vector V or ŶPLS = T V. Then the estimator βPLS of the original β has the form of βPLS = W V. A version of detailed PLSR algorithm is provided in Appendix.
In PLSR, despite different approaches, each factor t i is selected to maximize, in the sense of absolute values, the covariance Cov(t i , Y ), where Since the t i is a linear combination of independent variable, its variance may not be 1; the variances of all independent variables are standardized as 1.An ideal situation is that both Var(t i , Y ) and Corr(t i , Y ) decrease monotonously as Cov(t i , Y ) decreases during the process of selecting components; that means the most representative (due to large variance) and the most relevant (due to large correlation) elements in X would be used for the regression.Unfortunately, it is not always true that a large Cov(t i , Y ) will guarantee that both Corr(t i , Y ) and Var(t i ) are all large at the same time.
or X = (z 1 + 20z 2 + 65z 3 , z 2 , z 3 , 10z 2 + 9z 3 ) where z i denotes 4 × 1 vector which the ith element is one and the others are zero.We want to make regression of Y on X.The solution of the regression is obvious: in the term of relation between X and Y, Y = 3z 1 .Obviously the ordinary least squares (OLS) method does not work due to the multicollinearity.What would PLS method say on this example?We use PLS to find the factors t 1 , t 2 and t Clearly the chosen t 1 , which mainly composed with z 2 and z 3 through the last three columns of Z, has little relation with Y or z 1 .On the contrary, the last factor t 3 which has no chance to be selected by Cross-validation even under the least conservative criterion because of its small covariance although it is more correlated with Y than the first two.Therefore the PLSR does not work in this situation.To emphasize the information relevant to Y in the modeling process in order to reach better prediction, next we introduce the following modified partial least squares regression (MPLSR) algorithm.

Modified Partial Least Squares Regression (MPLSR) Algorithm
The main idea of our MPLSR methods is to use an orthogonal projection that removes from ŶPLS the elements irrelevant to Y. First, we find some factors which are linear combination of independent variables and orthogonal with Y. Second, the effect of irrelevant information in X are removed by projecting the ŶPLS on orthogonal complement space of those factors.The following is the algebra of the MPLSR algorithm.
For our model form an orthogonal basis of a k-dimensional space.All those orthogonal to Y can be expressed as XBα.Among unit vectors α (α α = 1), we pick up those that make variance of XBα maximum, which are the eigenvectors corresponding to the maximal eigenvalues of B X XB.So we select a number of maximal eigenvalues of B X XB, λ 1 , ..., λ s , such that the cumulative eigenvalue contribution proportion s i=1 λ i / k−1 i=1 λ i is greater than a certain value, 99% say.Let their corresponding eigenvectors be columns of matrix A ≡ (α 1 , ..., α s ).Also let U = XBA, which is orthogonal to Y.The projection of X orthogonal to U is ( Let the original PLSR fitted vector be ŶPLS and the estimated coefficient vector be βPLS .Then the fitted value from our MPLSR algorithm is defined by ŶMP LS ≡ XD βPLS .Let the estimated coefficients by MPLSR be βMP LS = D βPLS .
The estimation ŶMP LS reduces from ŶPLS the element of irrelevant information to Y and emphasizes the roles of relevant information during the estimation process.
Since the MPLSR is based on the result of PLSR method, a better result will be obtained when the prediction of Y by using PLSR methods includes more relevant information.

A Comparison of the MPLSR with Four Methods for Auction Price Case
In this section, we continue the discussion of the prediction problem in Section 2. The proposed MPLSR is used to predict residual values of two-year-inservice vehicles, and the results are compared to those from using the PLSR, VSS, RR and PCR methods.Table 2 presents average relative errors of predicting the five kinds of SUV vehicle lines using all the five methods (the results except MPLSR have been shown in Table 1).From Table 2, the other four methods produce larger predicting errors than the MPLSR on average.Comparing to PLSR, the MPLSR method produces smaller error, except that for 4Runner and Cherokee, which are very close, and both methods produce similar error patterns that closely track each other along different time (see Figure 2).
Unlike the PLSR, the performance of MPLSR is consistent when it is used to predict the auction prices of the five similar vehicle lines.When the MPLSR method is used, the average relative errors of the five vehicles are almost under 5%, and MPLSR's result for Blazer is much better than those from the other four methods.
The MPLSR method has a consistent performance not only on predicting auction prices of different kinds of vehicle lines but also on predicting one kind of vehicle's auction prices on different time.Table 3 provides the standard deviation of predicting errors that measure the deviation level of predicting errors of one kind of vehicle line.As noted in Table 3, in most situations, the standard deviation of errors from using MPLSR is less than that from using the other four methods on average for the test data.That demonstrates the better predicting capability and more stable results of MPLSR compared to the other methods.
Figure 2 presents the relative errors in predicting auction prices of the five SUV vehicle lines.In each picture, the relative errors from using the five methods are put together for comparison.As shown in Figure 2, the errors of predicting auction prices of Explorer by using the PLSR method are particularly large in the last two months.This unexpected large error is partially caused by the noise presented in the original data.The error is, however, significantly smaller when the MPLSR method is applied.The MPLSR removes irrelevant information and reduces the disturbance caused by noise.The errors of predicting auction price using the MPLSR method have smaller fluctuation.The predicted results of 4Runner's auction price, using PLSR and MPLSR, are very close in the year 2000.The average of the errors by using PLSR is smaller than that of MPLSR, and both methods result in similar patterns.

Relative error in Predicting Price
It is clear that the MPLSR produces significantly better-predicted results for Blazer than the other methods.The predicted values of PLSR, VSS, PCR and RR are much higher than the real auction value.All the four methods are influenced by the same kind of irrelevant information, and this information becomes very abnormal than usual in the last half year.Without removing the irrelevant information, the PLSR produces results having a large bias, and due to the removal of irrelevant information, the MPLSR's results track the trends very well although the pattern of prediction errors is similar to that of PLSR.
For each of these five vehicles, the errors by the PLSR and the MPLSR follow the same trend in time series, although their magnitudes appear to be different.Because the MPLSR emphasizes the information relevant to Y, its predicted results often follow the real values more closely than those of the PLSR method.
The five methods are also used to predict auction prices of five upper middle vehicles.The results are similar.The MPLSR method provides the most accurate and stable predicting auction prices among the five methods.
This practical example demonstrates that MPLSR algorithm does have advantages over other four when the multicollinearity exists.For further investigation, next we use Monte Carlo analysis to compare MPLSR method with the others.

A Simulation Comparison of MPLSR, PLSR, VSS, RR and PCR
To understand in what situations MPLSR can be expected to work well compared to other standard methods, VSS (stepwise), RR, PLSR and PCR, a set of Monte Carlo experiments is performed, and a summary of the results is presented in this section.
The five methods are compared in 360 different situations with different numbers of explanatory variables (k = 30, 60 and 100) and different levels of collinearity in the explanatory matrix.This means that the correlation matrix of explanatory variables have different structures (low collinear-all off-diagonal elements 0.4; middle collinear-all off-diagonal elements 0.7; high collinear-all offdiagonal elements 0.9).These situations also have different noise-to-signal ratio {σ/Var(α X) 1/2 = 0.05 or 0.1} and different true regression coefficients (20 sets of different regression coefficients β are generated randomly from normal distribution N (0, 100)).So there are totally 3 × 3 × 2 × 2 = 360 situations studied.For each situation, 100 data sets are generated and the results are reported as means of the 100 replications.Each data set includes 150 observations.The first 50 observations are training data that is used to estimate the regression coefficients by using the five regression methods (MPLSR, PLSR, VSS, RR and PCR) respectively.The last 100 observations are test data that is used to test the performance of the five methods.
The average prediction squared error (PSE) over the 100 test observations is used as the statistic to compare the performance of the five regression methods: The PSE values for each method in each situation are averaged over the 100 replications to compare the predicting capabilities of the five methods in different situations.The SDP values (standard deviation of PSE) for each method in each situation are calculated over the 100 replications to measure the stability of prediction of the five methods.The lower values of PSE and SDP indicate the better performance of the corresponding method.Frank & Friedman (1993) provided the optimal ridge parameter λ that minimizes the mean squared error (MSE) of the prediction, and therefore the optimal ridge parameter can be calculated in each situation to be the base of RR.The results of the four regression methods (PLS, VSS, RR and PCR) are obtained by SAS standard procedure (PROC PLS and PROC REG).
The results of the Monte Carlo experiments are presented in Figures 3-6. Figure 3 shows the average PSE and SDP of the five methods over the 360 situations.Figure 3 demonstrates that our new method MPLS has the best performance with smallest average PSE and SDP, which means accuracy and stability, and VSS (stepwise) being the worst.From Table 4, one can see that having the the smallest PSE among the five methods, MPLSR reduces PSE by 6.8 percents compared to PSE of PLSR, which has the second smallest PSE.This means that the MPLSR improves the predicting capability significantly.Since the SDP of MPLSR is the lowest among the five methods, MPLSR is the most stable method in the five methods.When the MPLSR is used, the SDP improves by 16 percents compared to the result of using PLSR method.The advantage of MPLSR on stability of prediction is significant compared with the other four methods.
From Figure 3 and Table 4, the new method MPLSR provides the best average overall performance significantly; the PLSR and PCR follow closely, and RR gives an inferior overall performance just slightly better than VSS.Since the performances of these methods may change with different situations, discussion their performance in different situation is necessary.
Figures 4-6 present a graphical detailed summary of classified results from this simulation analysis according to the characteristics of three kinds of data characteristics (number of independent variables, collinear level and noise-tosignal ratio).Figure 4 demonstrates MPLSR provides the best results in these levels of noise-to-signal ratio.Also RR behaves in a different way from the others: when Figure 4-6 shows that MPLSR provides the best and the most stable predicting results (the lowest PSE and SDP) among the five methods in almost all situations except one situation (see Figure 3, when k = 30), where the performance of the RR is better.
However, in our automobile market example, the predicting results of RR are no better than MPLSR although the OVR is higher than 50/30.One of the reasons is the difficulty of determining the ridge parameter in practice because it is impossible to obtain the optimal ridge parameter in a real problem.Because the RR method is sensitive to the ridge parameter, a bad ridge parameter will produce a model that cannot obtain a reasonable prediction.From this point of view, MPLSR is a more practicable method than the RR.
We should notice that the pattern of the five methods are similar in both the practical example and in simulations.This ensures the advantage of MPLSR method in different situations.

Discussion
In this paper, MPLSR method has been introduced when the explanatory matrix X includes much information irrelevant to the response variable Y.It is an algebraic algorithm based on the result of the PLSR method.Both Monte Carlo experiments and the practical example demonstrate that the new method produces more accurate and stable results than other standard statistical methods (VSS, RR, PCR and PLSR), especially when the observations-variables number ratio is low and the multicollinearity is high among independent variables.
We suggest that even in the steps of selecting components in PLSR, one should select not only the components with large covariance with the dependent variable Y but also the components with large correlation with variable Y..One possible way is to use PLSR between Y and XD instead of between Y and X; another is to make compromise between Var(t i , Y ) and Corr(t i , Y ) in the criterion used for selecting components in PLSR process.Let w * i = i j=1 (I p − w j p j ) and substitute t i in (3) with Xw * i , estimation of Y in PLS is ri w * i . (4) For simplicity, let βPLS = q i=1 ri w * i , then ŶPLS = X βPLS .

Figure 1 :
Figure 1: Relative errors in predicting of auction price is a real symmetric matrix with rank 1, it has k − 1 orthogonal eigenvectors correspondent to the zero eigenvalue.Let b 1 , ..., b k−1 denote the k − 1 eigenvectors corresponding to zero eigenvalue and B ≡ (b 1 , ..., b k−1 ), a k × (k − 1) matrix with columns of b 1

Figure 2 :
Figure 2: Relative errors in predicting of auction price (with MPLSR)

Figure 4 :
Figure 4: Performance comparison of five methods on PSE (4-a) and SDP (4-b) for two levels of noise-to-signal ratio.

Table 1 :
Average relative errors of the four methods It is possible that a factor t i corresponding to a large Cov(t i , Y ) caused by a large Var(t i ) but relatively smaller Corr(t i , Y ) may be selected while another factor t * i with a slightly smaller Cov(t * i , Y ) caused by a relatively smaller Var(t * i ) but a larger Corr(t * i , Y ) may be discarded.As a consequence of discarding information relevant to Y, ŶPLS has lower correlation with Y. Let us see a simple illustrative example, where no error term is added for the obviousness.
3 according to their values of Cov(t i , Y ) in descending order.The values of Cov(t i , Y ), Var(t i ) and Corr(t i , Y ) for i = 1, 2, 3 are Cov(t i , Y ) Var(t i ) Corr(t i , Y ) With common criterion in cross-validation, the PLS method selects only t 1 to be the regressor because it has the largest variance Cov(t 1 , Y ), which however is almost entirely due to the largest Var(t 1 ) despite its small Corr(t 1 , Y ).The reason for having these values of covariance, variance and correlation is the composition of t i .With matrix notation, the relation between t i and z j are t 1 t 2 t 3 = 1, z 1 z 2 z 3 •

Table 2 :
Average relative errors of the five methods

Table 3 :
Standard deviation of predicting errors of the five methods

Table 4 :
Table 4 provides the percentages that the MPLS method reduces the values of PSE and SDP from four other methods.Here in the table PPSE and PSDP are Reduced percentages based on all situations