The Classification Tree Combined with SIR and Its Applications to Classification of Mass Spectra

A new approach combining classification tree (CT) with sliced inverse regression (SIR) is proposed and applied to the classification of mass spectra in this paper. The classification tree has been widely used to generate classifiers from the mass spectral data because of its powerful ability in automatic variable selection and automatic interaction detection. However, it is often weak on presenting the linear and global relationships among variables. When the variables enter a model with the form of linear combination, the classification tree can not detect the form and leads to a low accuracy. SIR is an effective method to find useful linear combinations of predictor variables to regress the response variable. So merging CT and SIR harmoniously can inherit both advantages of them. Experiments in the paper show that the proposed approach can improve classification accuracy of decision tree and get better result than other classical classification methods.


Introduction
Mass spectrometry, an instrumental technique which is used to character and identify chemical compounds, produces large amounts of valuable data for the chemical structure elucidation.In a mass spectrometer, molecules of the investigated substance are ionized and the produced ions are separated according to their mass-to-charge ratio (m/z, mostly z=1) and the abundances of these separated ions are measured.Then a mass spectrum, bar plot with m/z versus abundance of ions (peak height) is provided.At present, the NIST 98 MS Database contains about 100,000 mass spectra with their corresponding chemical structures and the Wiley/NBS collection includes even more mass spectra to about 130,000 (NIST 98 MS Database and Wiley/NBS collection).How to extract the chemical structure information contained in these mass spectra?How to identify compounds and recognize structural properties automatically from mass spectral data?These are still open problems in chemometrics and have been attracted by many authors.
One group of strategy which has been widely applied to above problems is the classification methods based on multivariate statistics and data mining.Consider peak heights at integer mass-to-charge ratio as input variables and chemical structure properties, for example, the presence or absence of a certain chemical substructure, as response variables, classification methods can be used to find relationship between the structure properties and mass spectra or to train classifiers for automatic recognition of structural properties.There have been many kinds of classification techniques applied in this field, such as K-nearest neighbor classification, linear discriminant analysis, principal component analysis, neural networks, and classification tree.Knearest neighbor (KNN) is a local averaging method.Linear discriminant analysis (LDA) and principal component analysis (PCA) are both globally linear method.Neural network (NN) is the learning method based on the artificial intelligence.Among these classification techniques, classification tree (CT) is the tree-based method which is effective in capturing the local character.
However, not all methods are effective because of the complicated characters of mass spectral data.First, the dimension of a typical piece of the data is very high.In the development of classifiers, it is evident that the generation of suitable spectral features is a crucial step (Werther et al. 1994).Therefore, the variables considered consist not only the original peak heights, but also a set of additional appropriate spectral features.In order not to miss all important information in the classification procedure, it is common, just to stay on the safe side, to consider all kinds of spectral features in the development of classifiers.This makes the number of variables in certain such studies up to several hundreds (Werther et al. 2002).As the result the common phenomenon "curse of dimensionality" (Bellman 1961) in data mining also occurs for the mass spectral data.Secondly, it is known that complicated and widely unknown relationship between mass data and chemical structure do exist, various interactions may also present in the data.These types of interactions are even more difficult to model.For these reasons, many of the often used traditional classification methods are inefficient.
In view of these characters, a classification algorithm designed to develop the classifiers of mass spectral data should have two features: first, it can effectively select the variables which are important for classification; second, it takes various interactions into account during the modeling process.Although KNN, LDA and PCA are very useful classification methods in general, their applications to mass spectral data leave something to be desired.However, in many case studies, it has been shown that the classification tree is a powerful classification technique satisfying the above two demands.In particular, the approach proposed by Breiman et al. (1984) has been used to generate classifiers in analysis of mass spectral data.Moreover, its comprehensibility makes it more charm than neural networks, which usually provides result difficult to explain with subject matter knowledge.
It is known that the classification tree is weak at capturing classifiers of linear functions of the variables.In a typical mass spectral classification exercise, combinations of original peak heights are very likely to play the important role.For example, one of the numerical transformations successfully used for mass spectra classification is the summation of intensities at masses differing by a multiple of 14, which in most cases corresponds to a CH 2 group (Crawford and Morrison 1968).The transformation called Modulo-14 summation feature is simply a linear combination of the variables.Unfortunately the classification tree can not detect this feature, because a classification tree is typically based on a "divide and conquer" approach and the most useful variable is selected to split into a few (mostly 2) regions at every step.A natural extension is to add new variables, e.g., using certain transformations.An important class of such transformations consists the linear combinations of the original ones.However there are also technical considerations when we add these linear combinations as the candidates of predictors.For example, we need to address such issues as how to find useful combinations and how to add new variables into the tree model.In this paper, we propose an approach which combines the classification tree with a suitable non-linear regression technique called the sliced inverse regression (SIR).This new method recursively adds some new variables produced by SIR into a set of predictors during the process of tree induction.SIR, Li (1991), is an effective method in nonlinear regression which uses the principal components from high dimensional data.SIR has been applied to many fields successfully, but it is seldom used in chemometrics.Experiments in the paper show that our new approach can effectively improve the prediction accuracy of classification tree in mass spectral data.
The paper is organized as the follows.In Section 2 we briefly introduce the decision tree and SIR and propose our approach.Mass spectrum and mass spectral data are described in Section 3. Two spectral data sets are studied in Section 4 to illustrate our procedure.The last section gives conclusion that our approach can improvement performance of the original classification tree.

Classification tree
Classification tree has been popularized in the statistical community since the book of Breiman et al. (1984) was published.The method separates the sample by recursively splitting the data space till the stopping criteria are all satisfied (Friedman 1991).Suppose a data set consists of p inputs x 1 , • • • , x p and one response y, for each of N observations, that is, ( In the binary partitions process, the algorithm of classification tree firstly creates a node and finds a splitting rule to the given data set.The splitting rule including a split variable x j and split value "λ" assigns observations to either left or right branch of the node.The observations with {x j < λ} are assigned to the left branch and those with {x j > λ} to the right respectively.Then the data space is separated into two sub-regions.Secondly, repeat the same step in the each subspace till all the stopping criteria are satisfied.The extensive details of the algorithm can be seen in reference (Breiman et al. 1984).
There are many criteria based on different measures of node impurity, such as misclassification error, Gini index and deviance, to determine the choice of splitting rules in classification tree (Hastie et al. 2001).The criterion used in this paper is minimizing deviance.Suppose output variables y take value 1, ..., K and a node m, representing a region R m , consists N m observations (x (i) , y (i) ), impurity measure deviance D Rm is defined as where p k is the proportion of class k observations in this node m, i.e.
where I(•) is the indicator function.Then at each node, representing R, the classification tree performs a greedy algorithm to seek the splitting variable x j and split value λ by solving (x j , λ) = min where N is the number of the observations contained in the region R, i.e.N = |R|, and R 1 (x j , λ), R 2 (x j , λ) are the two sub-regions obtained by using splt pair (x j , λ): The decision tree is conceptually simple yet powerful.However, it has difficulty in capturing linear structure among the variables.If there exist some linear combinations of the predictors in separating classes from the start, the classification tree will lead a weak classifier because the effect of an error caused by a single variable in the top split can be propagated down to all of the splits below it.So adding some linear combinations of the variables should improve the predictive power of the tree.

Sliced inverse regression
In this subsection, we discuss the basic idea of sliced inverse regression (SIR) and present its computational algorithm, as introduced by Li (1991).SIR is an effective method in non-linear regression since it combines the good parts of principal component analysis, inverse regression and the information including the dependent and independent variables.
Suppose X is a n × p matrix with n observations and p variables and y is a n × 1 univariate output variable.Express X and y in terms of their rows, columns and elements by . . .
Li (1991) introduced the following model where g is an unknown function, and β 1 , • • • , β K are K unknown projection directions and the random error is independent of X, but its probability distribution is unknown.Obviously, y is a function of some linear combinations of x 1 , • • • , x p .The K directions are used to generate the first K principal components, called effective dimension reduction (e.d.r.) direction which is our primary interest.
Based on the inverse regression, SIR inverses the role of y and X and treats the y as if it were the independent variables and treats X as if it were the dependent variable.It has been proved that if X has been standardized to Z, the standardized inverse regression curve E(Z|y = y) is contained in the linear space spanned by the standardized e.d.r direction β 1 , • • • , β K .For details of the theory, reader are referred to Li (1991).The covariance matrix Σ η = Cov(E(Z|y)) is degenerate in any direction orthogonal to the β k 's.Therefore, the normalized eigenvectors, β k (k = 1, • • • , K), associated with the first largest K eigenvalues of Σ η are the standardized e.d.r directions.In the approach, Σ η are estimated by the slice method.The details are shown in the following algorithm of sliced inverse regression: 6. Find the eigenvalues of Ση , λ 1 ≥ • • • ≥ λ p and associated eigenvectors β1 , • • • , βp .The ith eigenvector βi is called the ith SIR direction.
7. Project X along the first K SIR directions to form principal components.
Similar to principle component analysis, SIR chooses the first K(K < p) directions, where K is often determined by the accumulative contribute rate of eigenvalues In practice, K is commonly chosen so that the cumulative contribute rate up to 80% ∼ 90% or more.Note that there are some differences between principal component analysis (PCA) and SIR.SIR constructs the new explanatory variables considering the information of both X and y whereas PCA's new variables are determined by only the values of X.The directions found by SIR are the ones that make y change quickly.
In classification, H, the number of slices, equals to the number of groups and each group forms a slice.Then we can find linear combinations of original variables by the above algorithm.Obviously, the new variables can be directly used in construction of a decision tree.Based on the above discussion we proposed the following method combining classification tree and SIR.

Approach combining classification tree and SIR
In our approach the vector of predictors are expended by the principal components found by SIR.However, only adding new variables into the decision space prior to the tree induction, the same defects of subtree discussed earlier still exists in subspace after splitting the decision space.So we add new attributes recursively, that is, after the original region is split into two sub-regions, we apply SIR to current data set in each sub-region to find the new principal components and add them into the vector of predictors to join the next selection of the splitting rule.
Given a data set same with that in subsection 2.1, algorithm of the Classification Tree combined with SIR (CTS) includes six steps, as following: 1. Assume current node m, representing the region R m , contain N m samples.Apply SIR to the N m samples and find the principal components to form the new variables.
2. Add the new variables to the set of predictors.(Let p be the number of predictors at this step) 3. Generate candidate splitting value set {λ ij } for each predictor is a sequence of ordered value obtained by sorting samples according to the value of the predictor x j .
4. Select the splitting rule (x j , λ) using equation( 3) from candidate splitting pair set 5. Generate two child branches and nodes to represent R m1 and R m2 which are obtained by splitting the region 6. Calculate the deviance of the two sub-regions and memorize the number of samples in the two sub-regions.If D R m1 ≤ α or D R m2 ≤ α, the corresponding branch is terminated as a leaf.If N m1 ≤ β or N m2 ≤ β, the corresponding branch is also terminated as a leaf.Then the leaf is assigned as the class k with the maximum proportion P k (defined in equation ( 2)).Otherwise, repeated the Step1 -Step5 for each child node until the deviance of each child node is less than α or the number of samples contained in each child node is less than β.
Usually α=0.05 ∼ 0.1, β = 5.Smaller α or β will generate a bigger tree with higher accuracy in training but easily lead to over-fitting.
In this approach, SIR directions at every step should be reserved to make sure that the samples need to be predicted can be projected along the same directions to produce new variables during the process of prediction.Then the samples can be inferred according to the training tree.We have succeeded to apply this approach to classification of mass spectra as shown in subsection 4.

Mass spectrum
Mass spectrum is one of the most important spectra in chemometrics.It is the graphical representation of mass-to-charge ratio (m/z, mostly z=1) of separated ions versus abundance (peak height).For example, Figure 2 shows a mass spectrum of C 2 H 4 O, called acetaldehyde.A mass spectrum is produced by instrumental technique, Mass spectrometry, which is commonly use to character and identify chemical organic.The distribution of peaks in a mass spectrum is very characteristic for a compound, although not unique.Main information obtained from a mass spectrum is molecular mass, and hints about substance class and parts (substructures) of the molecular structure.In our work, mass spectra and chemical structures are taken from the mass spectral library same with (Liang and Gan 2001).This library was established by transferring NIST62 mass spectrum library, which is built in the GCMS-QP5000 of shimadzu.The aim of spectra evaluation is either finding the relationship between mass spectra and chemical structures or identifying structure properties of an unknown compound from its spectrum.So in classification of mass spectral data, commonly, the height of peak at every m/z or spectral features (defined in the next subsection) are always considered as input variables, x 1 , • • • , x p , and the chemical structure properties, for example, the presence or absence of a certain chemical substructure, as response variables.

Spectral feature
In the following experiments, the vector of predictors includes not only the original peak heights of mass spectra but also a set of features obtained by transforming the original peaks.These spectral features are included because the following considerations: 1.As shown in earlier work, appropriate spectral features are simpler in molecular structures than the original peak heights of mass spectra (Lohninger andVarmuza 1987 andVarmuza 1998).
2. Although tree structure can detect interactions of variables and SIR can find the useful linear relationships, some simpler relationships are not presented yet.
3. Decision tree has powerful ability on variable selection.If a spectral feature is irrelevant, it cannot influence the model too much.On the contrary, if a useful feature is added in the process of tree induction, the accuracy will be improved and the size of tree will be reduced.
Let I m be the intensity of a peak at mass-to-charge ratio m/z.In the following, mass-to-charge ratio m/z is simplified as mass m.They are listed below:

A. Intensities at a single mass normalized to local ion current
Features of this group emphasize isolated peaks even if possessing only low intensities (Erni and Clerc 1972):
C. Modulo-14 summation One of the first numerical transformations successfully used for mass spectra is the summation of intensities at masses differing by multiple of 14, which in most cases corresponds to a CH 2 group (Crawford and Morrison 1968).They are obtained by This feature can characterize mass differences between peaks as well as periodicities in a spectrum (Wold and Christie 1984):

Experiment
In this section we apply the approach proposed in subsection 2.3 to mass spectral data.Two case studies are discussed.

Data precessing
In practice, at every step of finding SIR directions, the variables for which the range of all samples is less than 1 are removed in order to avoid the arithmetic problem caused by calculating the inverse of matrix.Mass spectral data are inaccurate data.They are influenced by noise.Although we do not know the exact value of noise, it is sensible to think the noise is not less than 1.So if the range of the variable is less than 1, it can be regard as useless for classification.Even if the variable contributes to the classification, it can be detected by decision tree because we do not discard it in process of tree induction.

Example 1
We use a simple sample to illustrate the approach proposed in this paper and to compare results with other classification methods.Alcohol and aether are isometric compounds.From library NIST62, we select all 201 compounds that belong to the two groups and molecular formulas range from C 5 H 12 O to C 9 H 20 O.The number of alcohol is 148 and that of aether is 53.From these 201 compounds, we select 141 of them at random as training sample and the remaining ones as testing sample.We assign heights of peaks at masses from 1 to 144 be the original variables because the molecular weight of C 9 H 20 O is 144.Denote the alcohol by "0" and the aether by "1".We introduce the process of the new method as follows." denotes splitting rule.
First, as mentioned in section 4.1, 79 variables are obtained in the training sample by removing the variables with ranges less than 1.These variables are used to find SIR direction.In this example, we only select the first SIR direction β to produce the new variable because the first eigenvalue's contribute rate is greater than 90%.Let Xβ, the first β selected, be the 145-th variable, here, to fix ideas, the design matrix X has been standardized.The splitting rules based on all the 145 variables uses x 145 at the first node, and the value 2.905 is chosen to be the split value.The whole space is divided into two subspaces according to this splitting rule.At "x 145 > 2.9050", the algorithm has reached the stopping criterion and in this subspace, since the proportion of class "0" is great than that of class "1", i.e.P 0 > P 1 (defined in equation ( 2)), the terminal node is set to be "0".For the case "x 145 < 2.9050", the algorithm goes one step further to find SIR direction for the current data set.It turns out that a new variable, x 146 , is added into decision the space, and the splitting pair (x 146 , 0.3566) is found.Here, the branch "x 146 < 0.3566" reaches the stopping boundary and is terminated as leaf "1".Finally, for the case "x 145 < 2.9050 & x 146 > 0.3566", the same process is done, and all branches reach stopping boundary.The full graph (tree) is shown in Figure 2(a).The directions found by SIR at every step are shown in Appendix 1.
In prediction, first we use the same transformations on test samples with the SIR directions to generate additional variables to make the predictors.Then the first splitting rule is used to divide all the 60 testing sample.The result is shown in Figure 2(b).38 of them fall into the subspace "x 145 > 2.9050" are predicted as "0".Here, one testing sample predicted incorrectly.The remaining 22 samples are separated by the splitting pair (x 146 , 0.3566) as shown in Figure 2(c).There are 11 fall into the branch "x 146 < 0.3566" and therefore are predicted as "1".Here all 11 cases are predicted correctly.Finally, the third splitting rule is applied to the remain 11 samples.From Figure 2(d), we can see the third split rule does not work in this situation because values of the 33-th variable of all 11 samples are less than 0.3.There is one sample misclassified.
We use leave-one-out cross validation to compare the predictive ability of our approach CTS with that of CT and PCA.The result is shown in the column of Table 1, where the result of classification tree is calculated by the program in S-plus 2000.It is observed that the predicting ability is improved by our approach.In order to find the relationship between the substructure and mass spectra, the classification tree obtained, using all 201 training samples with CTS, is shown in Figure 3(a), where we use 4 decision nodes to classify the two groups.The direction found by SIR at every step is shown in Appendix 2. From the first SIR direction used to form x 145 , we observe that variables x 27 , x 39 , x 41 , x 44 , x 51 , x 53 , x 55 , x 67 , x 69 contributed most and x 19 , x 33 , x 47 take second place.Moreover, (x 27 , x 41 , x 55 , x 69 ), (x 39 , x 53 , x 67 ) and ( x 19 , x 33 , x 47 ) are all CH 2 groups because the masses differ by 14 or multiples of 14 in each group.The values of the SIR direction corresponding with the first two groups are about 0.2 and those corresponding with x 19 , x 33 , x 47 are about ±0.15.It can be inferred that x 145 mainly reflects the combination of some CH 2 groups.Also, x 146 formed at second step also reflects the combination of some CH 2 groups.In this direction x 33 , x 47 , x 44 , x 53 , x 55 , x 67 are most important.In terms of the SIR direction used to get x 147 , we can observe x 59 , x 60 , x 29 , x 30 , x 33 , x 39 , x 41 , x 42 play the important role.The values of this SIR direction corresponding with x 59 , x 60 are smaller than -0.2 and those with x 29 , x 30 , x 33 , x 39 , x 41 , x 42 are greater than 0.2.When we consider (x 29 , x 30 , x 33 ), (x 39 , x 41 , x 42 ), (x 59 , x 60 ) as three local groups, x 147 may reflect the importance of the combination groups formed by some adjacent peaks.

Example 2
In this experiment, 350 samples from nitrogenous compounds and other 350 samples from no-nitrogenous compounds are selected at random.The range of their molecular weights is from 100 to 200.To illustrate the power of our approach and to supply a comparison base against other methods, two kinds of experiments are considered.In one first case, the predictors are selected as the original peak heights.In another case, the predictors include the original peak heights as well as the spectral features.The heights of original peak at masses from 1 to 200 are the first 200 variables and the next 85 predictors are feature A (defined in subsection 3.2) for masses 15-100 with ∆m=3; Feature B (defined in subsection 3.2) are obtained in the mass interval [15 100] with ∆m =1 and 2, resulting 170 features; Modulo-14 summation are performed in the mass interval 15-100, resulting in 14 features; The mass spectral feature autocorrelation, defined also in subsection 3.2, are calculated for masses in 15-100 with ∆m=1, 2 and 14-20, resulting in 9 features.In this way 478 variables are obtained.We compare the predictive ability of our approach with that of CT and PCA using leave-one-out cross validation.The result is shown in the last two columns of Table 1.The CTS obtained, using all 700 training samples, is shown in Figure 3(b).In this tree only 7 decision nodes were used compared with 31 decision nodes produced by only using CT.We do not display these SIR directions because of the limit of space, but it is available from of one of the authors.From the result, we can observe that better classification accuracy can be obtained by CTS which incorporates the appropriate features to predictors.In fact, the improvement in accuracy can be significant.So using variables known to be important from experience and other studies is necessary.

Conclusion
In classification of mass spectra, the Classification tree is a very useful tool because mass spectral data are high dimensional and full of complex interactions.However, the classification tree is very weak in representing the linear and simpler relationship among variables and global factors.Our approach, which combines the classification tree and SIR, can make up the defect of classification tree and show improvement in predicting power.
Experimental results show that this approach can improve the classification accuracy of classification tree and reduce the size of tree.They also show that introducing the appropriate features into attributes can improve the classification ability of tree structure.Note: Because the variables that may produce the arithmetic problem has been removed at every step, the i of x i denotes the position at original variables.

Figure 1 .
Figure 1.Mass spectrum of C 2 H 4 O Result obtained by approach combining DT and SIR Note: "." denotes alcohol, "x" denotes aether, " tree Note: (a) the first sample (b) the second sample Sort the data y = (y (1) , • • • , y (n) ) from the smallest to the largest.3. Divide data set into H slices according to the sequence of y: n h is the sample size in the slice I h , h = 1, 2, • • • , H.4. Within each slice, compute the sample means i.e., xh = n −1 h y (i) ∈I h x(i) .5.Compute the sample covariance matrix for the slice means of xh ,weighted by the slice sizes Ση

Table 1 :
The Result of Leave-one-out Cross Validation