Multivariate Chemometric Study on the Interfacial Properties of Nucleic-Acid Bases

Systematic quantitative structure-retention relationship studies of nucleic acid bases were carried out by the combined use of multivariate analysis and experimental chromatographic technique. The results revealed a multiple linear relationship between the chromatographic retention and the molecular structural parameters yielding a regression R2 value of 0.8113 (cross-validated Q2 = 0.6945). Five molecular descriptors, viz., moment of inertia (Ix, Iy and Iz), molar volume, and polar surface area, are able to account for the retention behavior of the compounds. Principal component analysis and factor analysis results indicate that the descriptors moment of inertia and molar volume have a primary influence on the chromatographic retention. The results provide useful insights for the future experimental and theoretical studies on the medicinal research of nucleic acid-base compounds.


Introduction
Nucleic acids, absolutely essential in living organisms, are constructed from nucleotides, which in turn are made up of five purine or pyrimidine nucleic-acid bases (nucleobases).Nucleobases and their derivatives/analogs are commonly found among designer drug molecules (Coulson 1994).Our main goal here is a systematic QSRR study on pyrimidines between the molecular descriptors (variables) and their chromatographic retention (experimental unit).We performed the analysis involving various regression approaches in relating retention to a size-specific, shape-specific, and polarity parameters.Redundant variables will be identified.
The quantitative structure-retention relationship (QSRR) (Kaliszan 1987) between experimental chromatographic retention data and molecular descriptors has been extensively studied for three main reasons: (1) explanation of the mechanism of chromatographic separations, (2) prediction of retention, and (3) characterization of solute physicochemical properties of importance for reactivity and especially for bioactivity (Kindsvater et al. 1974, Roland andRobert 1980).In QSRR studies, molecular descriptors are either determined from experiments or computed by molecular mechanics or even semiempirical quantum chemical techniques.
The rationale of choosing suitable molecular descriptors for a specific problem starts with the existing simple empirical theories of chromatography up to the highly sophisticated ab initio calculations by quantifying the relevant intermolecular interactions.In reversed-phase high-performance liquid chromatography (RP-HPLC), it is generally agreed that two types of intermolecular interactions govern solute retention: (1) polar forces resulting from permanent or induced dipole from solute, stationary phase, and mobile phase molecules, and (2) nonpolar forces resulting from dispersive interactions.The abilities of solutes to undergo dispersive interactions are generally expressed by physicochemical variables that reflect the size and shape of the molecule, such as moment of inertia and surface area.The polarity of solutes is expressed in terms of electronic parameters, such as dipole moment, polar surface area, and ionization potential.
There exists a handful of robust statistical regression methods for QSRR studies, such as Multiple Linear Regression (MLR), Principal Component Analysis and Regression (PCA and PCR), Partial Least Squares Regression (PLSR), and Factor Analysis (FA).PLSR is especially useful when there are more variables than experimental samples.A cross-validated R 2 , commonly referred as Q 2 , is computed analogously to the conventional R 2 .Factor Analysis (FA) has been used frequently in chromatography for two main purposes (Kindsvater et al. 1974, Roland andRobert 1980).One is to ascertain how many factors are necessary to account for the variance of the retention data.This useful information can be obtained without having to identify the factors.Second, it is possible to interpret the abstract factors with physically meaningful parameters.

Experimental Details: Experiments
The nucleobase standards were prepared in 40% (v/v) methanol/water and chromatographed using a HP 1050 liquid chromatograph.The HP liquid chromatograph is equipped with a UV diode-array detector set to 254 nm.The column temperature and the flow rate were kept at 35 o C and 0.5 ml/min, respectively, for the experiments.The retention of the nucleobases was expressed by their capacity factors log k, which were determined under various organic mobile phases by RP-HPLC: acetonitrile (ACN), methanol (MeOH), and tetrahydrofuran (THF).

Modeling
Initial geometry optimization for the thirty two pyrimidine compounds (Figure 1) were done with the molecular mechanics MMX force field using the program PCMODEL, version 7.5 (Schlecht 1998).The dipole moment (DM), polar surface area (PSA), molar volume (MV: molecular volume times the Avogadro constant), and molecular moment of inertia along the three principal axes (I x ≤ I y ≤ I z ) of the compounds were also calculated using the same software (Table 1).

Data analysis
Standard multivariate analysis (Johnson 1998, Johnson andWichern 2002) was used to probe the correlation between the retention parameter log(k) and the molecular descriptors using the publicly available QSAR routine (Fedders and Ponder 1996) and data analysis software (Lohninger 1999, Malinowski 2002).As an acceptable practice in QSRR studies, the criterion R 2 ≥ 0.81 for MLR, PCR, and PLSR is employed to decide whether a model is internally self-consistent, and a cross-validated Q 2 ≥ 0.5 for the robustness and absence of over-fitting in a model by the equation The summation term in the numerator is often referred as"PRESS" in the statistics literature.The leave-one-out procedure was adopted which means that each y pred term in the summation is predicted from the remaining (n − 1) experimental units (y obs 's), and this is repeated n times until each of the experimental unit has been left out once for the summation.The Q 2 value for a model with good predictive performance will be close to 1. FA was performed by SAS (2001).  1 for the structures of the molecules.
Table 2.The substituion pattern and the RP-HPLC retention value (log k) in two mobile phases of the pyrimidines.The core framework and atom numbers of the pyrimidine core is

Six-variate linear regression
The structures of the thirty two pyrimidine nucleobases are shown in Figure 1 with their selected retention parameters (log k) (Table 2).QSRR analysis was performed for these pyrimidines using various regression approaches in relating retention to a variety of size-and shape-specific variables (molar volume and moment of inertia), and polarity variables (polar surface area and dipole moment).
In a 20% ACN in water (v/v) mobile phase, compounds 1 and 2 are out of the experimental retention range.Attempts to correlate log k with all the six molecular descriptors for the remaining 30 compounds resulted in a poor linear regression (MLR, R 2 = 0.3898).Further removal of six outliers (8, 9, 21, 30, 31, and 32) from the regression analysis rendered marked improvement (Table 3).The resulting six-variate MLR model (in 20% ACN in water mobile phase) is Table 3: The six-and five-variate MLR and PLSR regression coefficients and the cross validation Q 2 values of the RP-HPLC retention models.

Five-variate linear regression
By applying PCR analysis, the principal component was reduced to 5 with a similar correlation (R 2 = 0.8113), which suggests redundancy in the model.Descriptor was removed one-at-a-time and the correlations were compared.Regression quality is similar to the six-variate one only if the variable dipole moment was eliminated.The results of the five-variate linear regression based on the removal of the dipole moment are compiled as in Table 3 and the model is In comparing equations (3.1) and (3.2), it is clear that the five-variate and six-variate models are almost identical apart from the descriptor dipole moment.Thus, corroborated with the larger F-Statistic coefficient of the five-variate model, the descriptor dipole moment has a negligible effect on the retention with respect to these models.
Figure 2 gives the regression relationship of the retention observed experimentally against that calculated theoretically by equation (3.2).The line drawn in Figure 2 is an expected 45 o line passing through the origin for an optimum correlation.
Statistical analysis using the PLSR method was also carried out (Table 3).While the mobile phase is 20% ACN in water, the Q 2 value for the five-and six-variate models are 0.6915 and 0.6496, respectively.This confirmed the validity of the PLSR model without over-fitting in predicting the retention values using the five molecular descriptors.
Even with some outliers present in the data, it is necessary to mention that the prediction of the retention of the 24 pyrimidines is feasible and the model is able to show how the pyrimidines retain on the RP-HPLC systems.Further design on pyrimidine-based biological entities with desirable cellular interfacial properties can make use of the insights obtained herein.

Principal component analysis
Intuitively, log k should increase with the molecular/molar volume (MV) variable.A quick inspection on the MLR models obtained above (equations (3.1) and (3.2)) leads to apparent contradiction.This strongly implies that certain extent of dependence exists among the selected variables if the models are robust.In order to enhance the interpretability, we have further performed PCA and FA on the data.The data has been formulated as a matrix using the five physicochemical variables as columns and the 24 pyrimidine compounds as rows.We further assumed that the samples behave random 5-variate normal.
The PCA results are shown in Table 4.The first three PCs (normalized with the sample correlation matrix), collectively, explain 99.20% of the total sample variance.Consequently, sample variation is essentially reflected by the first three PCs and a reduction in the data variables from 5 to 3 is reasonable.Here that the variable 2, with the coefficient 0.4962, receives the greatest weight in the first PC (PC1).It also has the largest correlation (absolute value 0.9612) with PC1.The weights of the variables 3, 4, and 5 with PC1 (0.4263, 0.4785, and 0.4924) are almost as large as that for variable 2, indicating that the variables are about equally important to PC1.For PC2, however, variable 1 has the largest correlation (absolute value 0.7726), while the other four variables have small or negligible contribution.
It is clear that variables 2, 3, 4, and 5 (MV, I x , I y , and I z , respectively) are intrinsically inter-dependent and have a primary influence on the chromatographic retention.The PCA results also indicate that variable 1 (PSA) reflecting the polarity of the molecules, has no negligible contribution to the retention.(a) of the total sample variance explained (%).Numbers in the parentheses are the correlation coefficients.

Factor analysis
Before applying factor analysis (three-factor principal component solution with rotation by varimax (Johnson 1998, Johnson andWichern 2002) to the data set, we have the same population distribution assumptions as in PCA above.The rotated estimated factor loadings and communalities are as shown in Table 5.The first three principal components (normalized with the sample correlation matrix) explain 99.20% of the total population variance as in PCA.It is clear that the variables 2, 4, and 5, define factor 1, F * 1 , with high loadings 0.7331, 0.9429, and 0.9134, respectively; but with small or negligible loadings on F * 2 and F * 3 .On the other hand, variable 3 defines F * 2 while variable 1 defines F * 3 .As a whole, the last four size-and shape-specific four variables (MV and moment of inertia) define F * 1 and F * 2 and collectively have a primary influence on the chromatographic retention for the 24 pyrimidines, while the polarity variable 1 (PSA) is subordinate.

Conclusion
Chemometric analysis has revealed a multiple linear relationship between the physicochemical molecular descriptors and the experimental retention parameters for the twenty four pyrimidine compounds.The excellent predictive power of the QSRR models render possible the estimation of retention indices of homologous compounds whose retention values are experimentally unavailable.Subsequent PCA and FA corroborate that the four size-and shape-specific descriptors are adequate in explaining most of the RP-HPLC retention behavior, while the polarity descriptor has only a secondary influence.The analyses also indicate that the four size-and shape-specific descriptors selected for this work are inter-dependent although individually they have physical meanings highly relevant in the interpretation of the chromatographic experiments.The convoluted effects of the moment of inertia from the data suggest that the spherical symmetry/asymmetry of the molecules is essential in the chromatographic retention.As a preliminary attempt, we have tried combining I x , I y and I z or simply I y and I z as single variables as suggested by the results, no improvement was observed for the regression.Further investigation in this direction for the optimal variable set selection is highly desirable.

Figure 1 .
Figure 1.The structures of the pyrimdines used in this study.

Figure 2 .
Figure 2. The plot of the relationship of the experimental versus MLR predicted RP-HPLC retention log k (equation (3.2)).

Table 4 :
Principal Component Analysis of RP-HPLC retention model for the 24 pyrimidines.

Table 5 .
Factor Analysis of the RP-HPLC retention model for the 24 pyrimidines.
* : Rotated by the varimax procedure