Boiling Points Predictions Study via Dimension Reduction Methods: SIR, PCR and PLSR

Variable selection is an important tool in QSAR. In this article, we employ three known techniques: sliced inverse regres- sion (SIR), principal components regression (PCR) and partial least squares regression (PLSR) for models to predict the boiling points of 530 saturated hydrocarbons. With 122 topological indices as in- put variables our results show that these three methods have good performance and perform better than some existing methods in the


Introduction
The goal of quantitative structure and activity relationship (QSAR) research is to establish a relationship between certain molecular activity and molecular descriptors by means of statistic tool (Devillers and Balaban 1999): A(activity) = f (molecular structure) = f (molecular descriptors).
For example, one can predict the molecular physicochemical, biological and toxicological properties from the statistic model based on chemical characteristics.One of the most frequently used chemical characteristics is the so-called topological index (TI).A TI is a numerical value which is intended to characterize the chemical structure of the underlying chemical compound.This numerical value quantifies a link between molecular activity and molecular structure of this compound.The concept of TI is so attractive that we are flooded with various TI's in the literature.Since the Wiener index, the first TI, was proposed in 1947, more than four hundred TI's have been defined by researchers.The TI's are proposed with a mind to capture the most important characteristic of molecular activity and structure of the compound.But molecular activity and structure are complex matters so no such TI that can completely determinate molecular activity for a certain class of chemical compounds.A natural approach, in fact a popular direction used by many authors, is to consider all possible TI's and choose a set of TI's such that one can predict a certain molecular activity based on this set.
Along this line of approach there exist many difficulties, such as (a) there are strong collinear among TI's; (b) many useful techniques of model selection, like the backward elimination and best subset, cannot be implemented as the number of possible models is 2 p which become large exponentially fast.It is clear that, with this approach, it is difficult to reach an ideal model.As a result, many more new TI's have have been proposed every year.It leads more serious situation in difficulties (a) and (b).Thus, selection of the efficient variables becomes very important in QSAR studies.How can we select efficient variables and use them to construct an ideal model?In this paper, we take the view that the performance of a learning method is measured by its prediction capability on independent test data.From this view, the better prediction on the new test data, the better the model is.We will use cross-validation for performance assessment in this paper.But first let us discuss about the interplay between bias, variance and model complexity.

Model selection and bias-variance tradeoff
Assume we have p input variables for each of the n responses.For example, in this paper, each variable may be a certain TI and each response is a boiling point of a compound under consideration.The output of n samples can be summarized as a n-vector y = (y 1 , y 2 , • • • , y n ) .The i-th observed value is written in lowercase as x i = (x i1 , x i2 , • • • , x ip ), corresponds to the p TI's of this compound.In this paper, matrices are denoted by bold uppercase letters, for example, a set of n input p-vectors x i , i = 1, 2, • • • , n would be written by the n × p matrix X = (x ij ) and the j-th column of X is denoted by x (j) .Give an input matrix X, the prediction of output y is typically denoted by ŷ.We will use the design matrix X and the response data y to find new variables for better prediction of the boiling points of compounds in the testing samples.These new variables are also called new features and the methods we employ are basically dimension reduction techniques.All the new features derived by the dimension reduction methods will be denoted as z m , m = 1, 2, • • • , M. The meaning of these notation will remain unchanged through this paper unless specified otherwise.
Assume that the true model is y = f (X) + where E( ) = 0, V ar( ) = σ 2 .It is unknown and we want to find an approximate model f(X) to replace the true one.The square residual at X = x 0 under the f (X) is For a linear approximate model f (x) = x β, where β is the least square estimator of β, we have Here h(x 0 ) = x 0 (X X) −1 X .While this variance changes with x 0 , its average (over n 1 testing sample values) is p n 1 σ 2 , hence, ) is the goal of constructing a satisfied model.Let's study the three parts on the right hand side of equation (1) above: (1) The first term on the right hand side of (1) cannot be avoided no matter how well f (x i ) to be estimated; (2) The second term is: Let β * denote the parameters of the best fitting linear approximate to f : then, If β is ordinary least squares estimation, the "Estimation Bias" is zero, while "Model Bias" can only be reduced by enlarging the dimension of the input variables space, p.
(3) The third term p n 1 σ 2 obviously increases as p increases.So there is a bias-variance tradeoff behavior.The Figure 1 shows the typical behavior of the test and training error, as model complexity is varied.

Cross-validation
In this subsection we describe the cross validation method that we shall use to find a suitable set of p variables in our eventual model which minimizes prediction error of testing samples.
Ideally if we have enough data, we would set aside a validation set and use it to assess the performance of our prediction model.Since data are often scare, this is usually not possible.To finesse the problem, K-fold crossvalidation uses part of the available data to fit the model, and a different part to test it.We split the whole data into K roughly equal-sized parts; for Typical choices of K are 5 or 10 when the number of observations is very large.The case K = N is known as leave-one-out cross-validation.Our data has 530 observations and we employ leave-one-out cross-validation to assess the ability of every method.At the same time, the function CV (β) provides an estimate of the test error curve, and we find the tuning parameter β that minimizes CV (β).The final model is thus obtained for fitting all the data.
In this paper three function approximation methods, i.e. sliced inverse regression (SIR), principal components regression (PCR) and partial least squares regression (PLSR) are compared.These methods are evaluated under the framework of chemometrics, especially in QSAR research.We further analyze similarities, difference and efficiency of these methods from their basic reasoning and concept.Finally, we compare our results with that of previous work and conclude that SIR, PCR and PLSR are promising methods in reducing the dimension of variable space.
This paper is organized as follows.In Section 2, we will briefly introduce the methods of SIR, PCR and PLSR.Then, in Section 3 these three methods are applied to a real set of chemical data.The cross-validation is employed for comparisons among the models recommended by the three methods.

Dimension Reduction
In this section, three methods, sliced inverse regression (SIR), principal components regression (PCR) and partial least regression (PLSR) are briefly introduced.

Sliced inverse regression (SIR)
The data set for SIR is of the same form as for a typical linear regression.There are n observations on p variables.A typical observation is of the form We can, of course, rewrite it with the familiar matrix notation: the response vector y is a n × 1 together with a design matrix X which is n × p in size.The distribution of X is assumed elliptically symmetric (e.g., the normal distribution).Li (1991) suggests a model with the structure which is used to approximate the true one, where g is an unknown function.In SIR, β 1 , β 2 , • • • , β m are m p-dimensional column vectors, called projection directions.A salient feature of SIR is the concept of the effective dimension reduction (e. d. r.) space B which is generated by the m vectors β 1 , β 2 , • • • , β m .Any non-zero vector in the e.d.r.space is called an e.d.r.direction, and in a typical SIR application, it is hoped that m is much smaller than p.If this is the situation, then we only need to find estimates of these m p-dimensional vectors In exploring non-linear relationship between y and x, SIR considers the inverse regression that treats y as if it were the independent variable and treat x as if it were the dependent variable.In fact, it is the curve η(y) = E(x|y), treated as a function of x but conditioning on y, that is to be explored.We denote conditional covariance matrix of x by Σ η = Cov(E(x|y)).Note that the point E(x|y = y), being p-dimensional by definition, stays in p-dimensional space, it can be shown that it is contained in the linear subspace spanned by , where Σ x is the covariance matrix of x.The details of the theory can refer to Li (1991).
The estimations of β 1 , β 2 , • • • , β m can be obtained from the eigenvalue decomposition of Σ η with respect to Σ x : where λ i is the i-th eigenvalue and β i is the corresponding eigenvector.For detailed statistical theories justifying the inverse regression view, readers are referred to Li (1991), and Duan and Li (1991), Zhu and Fang (1996) and Chen and Li (1998).
We call the eigenvector β1 the first SIR direction and X β1 the first SIR variate, X β2 the second SIR variate, and so on.After finding a good e. d. r. space, we can project data into this smaller space and then estimate the response surface applying smoothing techniques to this projected variables.For the sake of simpleness, the models which we will estimate are assumed linear regression models based on the new input variables.

Principal components regression (PCR)
When a regression model has a large number of input variables, the collinearity between variables is always the problem.One approach to overcome this difficulty is to transfer the original variables to be new orthogonal variables and hoping just a few of them will accommodate most of the variations of the data set.This approach has been long employed in the statistical literature.For example, the principal components analysis accords with this idea.It "models out" the main part of the variation from the original data set X by: where the vector v m is the eigenvector associated to the m-th largest eigenvalue of the covariance matrix Cov(X).If this set of eigenvalues can be found, then a model based on y, regressed over z 1 , z 2 , • • • , z M , can serve as a simpler model for prediction.Since the principal components z m 's are orthogonal, this regression is just a sum of univariate regressions, where θm = zm,y zm,zm .And because the z m 's are linear combinations of the original x (j) 's, we can express the least squares solution in term of coefficients of the x (j) 's βpre Note that if M = p, we would just get back the usual least squares estimates.If M is less than p, the number of dimension of original variable space can be reduced.

Partial least squares regression (PLSR)
Partial least squares regression (PLSR) is another approach aimed for dimension reduction which was introduced by Wold (1975).The idea of PLSR is similar to that of PCA, but with the modification that both y and X are considered in the process.In PCA, only X is used.It begins by computing the univariate regression coefficient γ1j of y on each x (j) , that is, γ1j = x (j) , y / x (j) , x (j) .When x (j) is standardized to have mean 0 and variance 1, we will have a simpler formula: γ1j = x (j) , y .The quantity z 1 = γ1j x (j) is called a derived input which is the first partial least squares variate.From this expression, it can be seen that each inputs is weighted by the strength of their univariate effect on y.
If we use the first new feature to fit the response y, the least square estimate θ1 on z 1 can be found.Then we orthogonalize x (1) , x (2) , • • • , x (p) with respect to z 1 to get another set of input variables.We use this set of input variables as before, and based on which we produce the second PLSR variate.Repeat the process until M ≤ p new PLSR variables have been computed.
So we obtain a sequence of derived inputs z 1 , z 2 , • • • , z M .If the model involves using all z 1 , z 2 , • • • , z p , it reduces to the usual least squares estimates.The algorithm 2 gives the process and Chart 1 is the flow chart of PLSR: multi-linear regression model.

Data clearing
Data clearing is important in data mining.There are a lot of preprocess before making an analysis of data, such as filling in missing values, smoothing out noise, detecting outliers, deleting unnecessary variables, and so on.According to the characteristics of our chemical data, we carry through the following actions.

Results of dimension reduction methods
For convenience, we use M to denote the number of variables which are used in the reducing dimension methods.We first use SIR to get 92 projection directions and these are used for input variables.Figure 2 gives plot of test and training error against the number of variables.It shows that the training error decreases when M increases while the same behavior does not always occur for test error.The minimal test root mean squared error (RSME= ave(y − ŷ) 2 ) of SIR is 4.3625 at M = 57 and when M ranges in 30 to 70 the behavior of test error is more stable.Figure 3 gives the performance of PCR.When M increases to 30, the test error begins to level off until M is about 60 and the minimal test error of PCR is 4.3267  For convenience, we also list RMSE of the same data in Table 2 and Table 3 presents comparisons of the three dimension reduction methods.The results of Table 2 come from Rücher and Rücher (1999).These are the best results at present using linear regression models.Table 2 includes only RMSE of training samples and Table 3 gives RMSE of test samples through leave-one-out cross-validation.Usually test errors are larger than training errors.So there is a comparability between our results and the previous work.From the two tables, it shows dimension reduction methods we used are better.
Table 3 gives relative results of SIR, PCA and PLSR.The three results come from the same model structures: ), here f is assumed a linear function.The difference among them is the technique which seeks direction {α i } m i=1 .As we all know, in principal components regression, the m-th principal component direction α m solves: The condition Corr(Xα i , Xα) = 0 ensures that z m = Xα is uncorrelated with all the previous linear combinations What optimization problem is the partial least squares solving?It can be shown that the partial least squares method seeks directions that have high variance and have high correlation with the response, in contrast to principal components regression (Stone andBrooks 1990 andFrank andFriedman 1993).That is to say, the m-th PLS direction α m solves: Unlike PCR, the {α m } m i=1 PLS finds are not orthogonal owing to the different criterion.For the sliced inverse regression, reversing the role of y and X, the similar process is to find a variable (derivable from x (j) linearly) which is most predictable from y.It is easily proved that the best prediction is E(Xα|y), a nonlinear function of y in general.Thus the most predictable variable is the one which maximizes: From Table 3, we find that the results of SIR's and PCR's are very similar.
According to theoretical analysis, we can conclude that the results of SIR's may be better than that of PCR's because SIR considers the information of the response when it constructs new variables.But this superiority of SIR over PCR seems not to represent obviously to its performance.There are two reasons.First, this chemical data has an important characteristic different from other data that most of its topological indices are high collinear with its boiling points.That is to say, the design matrix has already included most information of the response.Under this circumstance, SIR and PCR just have the same ability in dimension reduction.Second, all results come from linear models.If we employ generalized additive model or other nonlinear model, the results of SIR should be prior to that of PCR.Now let's compare the results of PCR's and PLSR's.PLSR use fewer components to achieve about the same result as PCR, generally about half as many components.This property has been empirically observed for some time and is often considered as an argument in favor of the superiority of PLSR over PCR.So we can fit the data to the same degree of closeness with fewer components, thus producing more parsimonious modes.The superiority of PLSR over PCR can be explained from their different criteria.PCR only uses the input variables information to determine its components, whereas PLSR uses the response values as well.Obviously, the response variable information contributes to the construction of the new input.So PLSR can fit the training data and predict the test data to a higher degree of accuracy than PCR with the same number of components.In conclusion, PLSR is a better satisfied dimension reduction method applied to highdimensional chemical data, we will make a further study using more general model based on the directions derived by PLSR technique.

Figure 1 :
Figure 1: Test and training error as a function of model complexity.

Figure 2 :
Figure 2: Test and training error of SIR through cross-validation

Figure 3 :Figure 4 :
Figure 3: Test and training error of PCR through cross-validation Sort the response y = (y 1 , y 2 , • • • , y n ) from the smallest to the largest.3.Partition the standardized data set xi 's into H slices according to the sorted sequence of y: n h is the sample size in the sliceI h , h = 1, 2, • • • , H.We choose I's so that n h are approximately equal.4.Compute the sample mean of each slice: where Σx and x are the sample variance

Table 2 :
The previous work