Application of Orthogonal Block Variables and Canonical Correlation Analysis in Modeling Pharmacological Activity of Alkaloids from Plant Medicines

A new kind of orthogonal block variables, derived from subspace projection and canonical correlation analysis, is applied to model pharmaological activity of alkaloids from plant drugs. The alkaloids are grouped into three cases by intravenous, intraperitoneal, and subcutaneous injections. Four block variables (family of variables) investigated in this work are valence molecular connectivity index, alpha kappa index, E-State index and element counts of molecules, respectively. The regression model embracing only few new orthogonal block variables against pharmaological activity shows significant improvement than those, say multiple linear regression (MLR) simply using original variables, principal component regression (PCR) and the ones selecting only one or two of the original family variables, both in fitting and prediction ability of the correlation model. The reason for this might be that the new orthogonal block variables in fact include almost all of the information of the original variables but without collinearity between them.


Introduction
Herbal medicine (HM) has a long therapeutic history over thousand years and is currently still serving many of the health needs of a large population in the world.However, currently existing approaches for quality assessment cannot fulfill the practical requirements of the safety and efficacy of HMs.One of these reasons might be that, unlike a chemically synthetic drug with much purity, a HM and/or a HM formula may consist of hundreds of complex phytochemicals.First to model the activity of individual composition from plant drugs, and then to study the synergistic action of its components might be useful for revealing the mystery of Chinese herb medicine.Thus, the technique developed in chemometrics, the so-called quantitative structure-activity relationships (QSAR), is used to fulfill the above-mentioned task.
The aim of QSAR is to relate the structure of a molecule to a biological activity by means of statistical tools, which can be expressed mathematically as follows (Devillers 1999): where A denotes the activity of chemical component, which is essentially a biological measurement value.In order to evaluate structural similarity and diversity of the molecules and/or to build QSAR model as shown in the above equation, one needs first to obtain the suitable numerical molecular descriptors associated with the molecular structure in QSAR researches.
In fact, there are many molecular descriptors available, such as quantum chemical descriptors, physical chemical parameters, and topological indices, to describe the molecular structures.Only for topological indices, there emerged hundreds (Katritzky et al. 1994) of indices since 1947 (Wiener 1947).However, the multiplication of descriptors caused worry in some parts of the scientific community (Balaban and Ivanciuc 1999).
In QSAR research, to evaluate whether the information contents of descriptors are enough to describe the molecules and how much information is not "duplicated" by other descriptors are two very important aspects in building QSAR equations.Randić 1991 proposed orthogonal method to select variables.Xu and Zhang 2001 studied systematically some of the ingenious methods, such as forward selection, backward elimination, stepwise regression, leaps-and-bounds regression and genetic algorithm.In many cases, the information of descriptors is not enough and, under this case, whether the variable selection method by deleting some variables for regression is the best choice is still a question.The methods in our former study (Du et al. 2002) and the present work might offer a new way to select variable by including almost all the information of original variables and, at the same time, reducing the number of variables.
In the research (Du et al. 2002), a subspace projection method is proposed to orthogonalize block variables in modeling the relationship between structure and retention index.The regression against retention index shows significant improvement both in fitting and predicting ability of the correlation model.Moreover, the quantitative intercorrelation between the different block variables of topological indices can also be evaluated by the proposed techniques.The basic idea is to first classify descriptors into different blocks (groups) and then, apply canonical correlation analysis to get new variables to represent the different blocks.
The alkaloids from natural sources are of importance in medical studies.Elbein and Molyneux 1999 reviewed alkaloids, isolated from natural resources, as inhibitor of glycoprotein processing.Wang and Xie 1999 reviewed the clinic effects of alkaloids of Chinese aconitum plants.To correlate the structure with activity of alkaloids is of use to predict the activity of other alkaloids, to deeply understand the changes of different chemical structures upon the activity and finally to make modification on the original structures to improve the activity.
Topological index has advantages of simplicity and quick speed of computation (important for large data) and so attracts attentions of scientist.What is important is that topological descriptors can explain most of the property modeled, as shown by some researchers (Basak et al. 1999, andBrown andMartin 1997).The research (Basak et al. 1999) indicates that the easily calculable topostructural and topochmical indices will be an effective first choice in QSAR studies.Brown and Martin 1997 concludes that 2D descriptors are better than 3D descriptors from information content.There are many kinds of topological descriptors in modeling pharmaological activity of drugs (Hu et al. 2003a).In this work, three most popular topological index families are first selected to build the statistical model.The first is valence molecular connectivity index (Kier and Hall 1976), sec-ond is alpha kappa index (Kier 1986), and the third is e-state indices (Kier and Hall 1990).In order to describe the heteroatomic effect in investigated alkaloids, element counts are also included as the fourth block variables.The valence molecular connectivity index has wide applications (Kier andHall 1986, andHall andKier 1991) in modeling activity of drugs.Kappa index codes information of cyclicity, spatial density, centrality of branching, and symmetry of molecules (Kier 1986, andKier andHall 1999) and it has been applied to many situations in QSAR researches (Kier 1985, 1997, and Shen 1967).The E-State index (Kier 1986, andHall andKier 1999a) is a very successful topological index for modeling activity of drugs, which is discussed in detail in a book (Hall and Kier 1999b).E-State indices have been used in molecular similarity and diversity research, and QSAR study (Hall et al. 1995, Kellogg et al. 1996, Hall and Vaughn 1997, and Hall and Story 1996).Furthermore, element counts combined with other topological indices have also been successfully used in QSAR studies (Balaban et al. 1992a, andBalaban et al. 1992b).It is worthy noting that none of any single family of the above mentioned variables could give satisfactory results if one tries to correlate them individually with pharmaological activity of alkaloids from plant drugs.Thus, in the present work, orthogonal block variables, derived from subspace projection and canonical correlation analysis, are applied to model pharmaceutical activity of alkaloids from plant drugs.The regression shows that the results by a few orthogonal block variables including almost all of the information of original descriptors are much better than by selecting one or two of the original family variables.

Methodology
In the former study (Du et al. 2002), orthogonal block variables that are from some families of topological indices or quantum chemical parameters were proposed by applying a subspace-projection method.The outline of the method is only briefly given in the following sections.

Orthogonalization of block variables by subspace projection
A series (or a family) of topological indices (not individual index) with similar calculation strategy were often encountered, such as the molecular connectivity indices ( A series of descriptors were generally defined by accounting for more molecular structure information and less redundancy.Thus, a series of descriptors might be considered as an ensemble named block descriptor (variable), which includes all individual descriptors in this series.Being similar to the orthogonalization of individual descriptor, orthogonal block descriptors (variables) would also be obtained easily.The advantage of using block descriptor is that one may work with only a few block variables instead of many individual variables.The procedure of orthogolization of the block variables could be fulfilled in the following steps: 1.The procedure starts by selecting a block variable say X 1 , as the first orthogonal matrix Ω 1 .The second orthogonal matrix Ω 2 can be obtained through the orthogonal projection, that is 2. Ω 3 , which will be orthogonal with both Ω 1 and Ω 2 , can be calculated easily by first defining X j = [Ω 1 Ω 2 ], X i = X 3 and then using the following equation, that is, Similarly, a series of orthogonal matrices of Ω 1 , Ω 2 , • • • , Ω n can be obtained.

Canonical correlation analysis (CCA)
Canonical correlation analysis (CCA) (Mardia et al. 1979) offers a way to establish the maximum correlation between variables.The original aim of CCA is to find linear combinations of Xa and Yb, which makes the correlation between Xa and Yb maximum.Xa and Yb are called canonical correlation variables.Only consider variance of v(Xa) and v(Yb) to be one and if there exits a 1 and b 1 making R(Xa 1 , Yb 1 ) = max R(Xa, Yb), then, Xa 1 and Yb 1 are called as the first pair canonical correlation variables.
After getting the first pair variables, second, third and so on pair variables can be found step by step.The canonical correlation variables reflect the linearity between X and Y.The problem of obtaining the canonical correlation variables is how to calculate the eigenvalues and eigenvectors of the matrix K=(V

YY
).Through singular value decomposition of the matrix K, u i and v i can be obtained by The canonical correlation variables can be calculated by the formula ) And then, Xa 1 and Yb 1 are obtained as the i th pair of canonical correlation variables.

Outlines of the calculation procedure
1. Split all the given descriptors into a few subsets, say X 1 , X 2 , • • • X n , each of which comes from the same family of descriptors proposed by the same authors.
3. Orthogonalize block variables by equation (3).Note that the order of variables strongly impacts on the orthogonalization result.Here we use "based on R i " approach to orthogonalize variables.First pick up a block variable in the set of X 1 , X 2 , • • • X n with maximum correlation coefficient R against the property y as the first orthogonal block variable Ω 1 .Then for the remaining block variables, calculate their orthogonal block variables to Ω 1 by equation ( 2), and select the orthogonal block variables with maximum R in the left ones as the second orthogonal block variables Ω 2 .The third orthogonal block variable Ω 3 is such orthogonal one to Ω 1 and Ω 2 that have maximum R in the remaining ones.Other orthogonal block variables have the same calculation procedure.
4. The canonical correlation variables can be calculated by using equations (4) and ( 5).Note that Y is actually the property vector y in the present work.Thus, b 1 is a scalar and there is only one pair of canonical correlation variable for each orthogonal block variable with y.The new orthogonal variables, ω 1 , ω 2 , • • • , ω n , corresponding to the orthogonal block variables, say Ω 1 , Ω 2 , • • • , Ω n are then used to build the regression model.
5. Select a few variables with maximum correlation coefficient R i to establish the descriptor-property correlation model if necessary.

Drug data collection
The total 65 compounds, all of the alkaloids with LD50 for mice of the reference (Shakirov et al. 1996), are from plant drugs.The details of the compounds are listed in the Table 1, which is divided into three cases, according to different injections, intervenes, intraperitoneal, and subcutaneous injections, respectively.The column (NO.) of Table 1 corresponds to the names of the compounds.The data of activity values (y) and all the numerical descriptors (X) of the compounds are not given here for the sake of brevity of the paper.They are available from the corresponding author, if readers are interested in them.

Descriptor calculation
In the present work, four series of descriptors are selected.They are valence molecular connectivity index ( (Kier and Hall 1976), alpha kappa shape index ( 1 κ α , 2 κ α , 3 κ α , Φ) (Kier 1986), E-State index (Kier and Hall 1990), and element counts (N C , N O , N N ), respectively.The descriptors are calculated by the heuristic queue notation (H.Q.N.) system (Hu et al. 2003b).The descriptors used in the QSAR studies of the three cases are listed in Table 2.The indices from same sources, such as proposed by the same author or derived from the same invariants, should In order to give the readers an intuitive impression of how to get the numerical quantifier of the molecular structures, an example (No. 3 deoxypeganine) from Table 1 is given to show the procedure.The chemical structure of deoxypeganine is shown in Figure 1.With the help of the structure, the topological indices can be calculated by the definitions listed in the proceeding paper (Hu et al. 2003a).What should be noted is that the original definitions for hydrocarbons are modified by introducing some chemical parameters for molecules with multiple bonds and/or hetero-atoms.An example of the indices of deoxypeganine is listed in Table 2.

Results and Discussion
The aim in QSAR is to use the equation ( 1) to build a model correlating the numerical molecular descriptors with their corresponding activities so as to further predict the activities of the similar molecules.In general, the linear model is the first choice, since the reason why the molecules have activities can be easily deduced with the linear model.From Table 1 and above discussion, one could easily see that the number of the samples is rather small, say 39, 26, and 32, respectively, in the present study.However, the number of variables included in the model is 26 (see Table 2), which hints that the overfitting might be the most serious problem to be faced in this work.

Correlation by different descriptors
First, we tried to use one family of molecular descriptors to build regression model.However, the regression results listed in the Table 3 are quite disappointed.The information contents of any individual group of variable are not enough to obtain satisfactory results.Then, the whole variables are used to model the activities, and the regression coefficients for the three cases are 0.8781, 0.9993, 0.9797, respectively.The fitting results seem to be quite good.In order to check the stability of the built models, leave-oneout cross-validation is applied for the three cases using cross-validated root mean square error of prediction (RMSECV) criteria, that is

Limitation of PCR
The results obtained from PCR are shown in Table 5.From the table, the correlation coefficients for all the individual principal components from the first to the 25 th (Table 5) show that the order of the values of R has nothing to do with the order of the eigenvalues of the principal components.Thus, it is impossible to select reasonably the number of principal components to be included in PCR model.
Commonly in chemoemtrics, the leave-one-out cross-validation is adopted to choose the right number of the principal components, which are shown in Figure 2 for the three cases.From the plots, it can be seen that several minima in the curves are found, which makes the choice of right number of principal components very difficult.For instance, the R and s for the minima at five and seven principal components are 0.6809 and 263.3081; and 0.7682 and 230.1728 respectively, for Sc32 case, which definitely cannot be accepted by chemists.Since the molecular descriptors are from four different families, they can be grouped into four blocks, and then all the blocks are replaced by new orthogonal block variables with the help of canonical correlation analysis.Then, the four orthogonal block variables are utilized to build the regression model through"based on R i " approach described in methodology section.The regression results obtained by the method proposed in this work are shown in Table 6.One can see that the regression coefficients, standard errors and F test are quite satisfactory.In order to check the stability of the model, leave-one-out cross-validation is also adopted.The RMSECV are quite close to the size of the standard errors of the model, which indicates that there is no overfitting in the model and the prediction ability of the model is also quite good.All these show that the orthogonal block variables by subspace projection and canonical correlation analysis may offer a new method to reconstruct the variables and the method proposed in this paper might have a promising prospect in QSAR researches and data mining in chemistry.

Figure 1 .
Figure 1.Molecular skeleton and numbering of atoms of deoxypeganine

Figure 2 :
Figure 2: Relationships of cross-validation (leave-one-out) vs the number of principal components for the three cases.

Table 1 :
Active compounds from plant drugs and the biological activities

Table 2 :
The topological descriptors and their corresponding values for deoxypeganine ) in which n is the number of observation and PRESS is the predicted residual squared sum.The results are collected in Table4.From the table, one can easily conclude that the prediction ability of the model is very bad, say 238.3407, 1.3676e+003, 1.3777e+003, respectively.This means the overfitting is clearly embedded in the MLR models.In order to cure such situations in QSAR researches, the chemists always resort to the principle component regression (PCR) and partial least squares (PLS) developed in chemometrics, since these techniques may reduce the dimension of variable space efficiently.

Table 6 :
Regression results with Orthogonal Block Variables