A Monte Carlo Comparison of Two Linear Dimension Reduction Matrices for Statistical Discrimination

We compare two linear dimension-reduction methods for statistical discrimination in terms of average probabilities of misclassification in reduced dimensions. Using Monte Carlo simulation we compare the dimensionreduction methods over several different parameter configurations of multivariate normal populations and find that the two methods yield very different results. We also apply the two dimension-reduction methods examined here to data from a study on football helmet design and neck injuries.


Introduction
Statistical classification involves assigning a given observation x to one of k possible classes (or populations) based on p measured variables, also known as features.As the dimension of the feature space p increases, the computational complexity for the classification task can become cumbersome and time consuming.In addition, more training samples are needed to design appropriate classification rules.Therefore, one often desires to reduce the dimension of the original feature vector if possible.
In this paper we consider the topic of linear feature reduction for statistical classification.Specifically, we compare and contrast the efficacies of two linear feature-reduction methods formulated by Brunzell and Eriksson (2000) and Tubbs, Coberly, and Young (1982) using a Monte Carlo simulation.The two linear dimension-reduction methods considered resemble each other but do not, in general, give equivalent results in terms of expected probabilities of misclassification.In this paper we clarify some differences and similarities between the two methods by addressing the following questions.When and why do the methods give different or similar results?For which data characteristics is one method better than the other?Can either method improve the probability of classification compared to using the full dimension of the feature vector?
To address these questions, we perform a Monte Carlo simulation study to compare classification performance in the full feature space versus classification in a reduced space determined via the methods developed by Tubbs, Coberly, and Young (TCY) and Brunzell and Eriksson (BE).We note that BE have contrasted their linear dimension-reduction approach to that of TCY, and to other methods such as the Mahalanobis-based linear transformation, canonical variables, principal components analysis, and four variations of Fisher's discriminant.For more comparisons of pattern recognition methods in high-dimensional settings, see Aeberhard, de Vel, and Coomans (1994).
On the data sets considered in BE, BE's dimension reduction method is uniformly superior to that of TCY in terms of yielding smaller expected error rates in a reduced dimension.Our goal is to analyze the performance of these two linear feature-selection matrices over classification problems with diverse parameter configurations.

Linear Dimension-Reduction Matrices
In pattern recognition and statistical discriminant classification problems, one often desires to reduce the dimension of the feature space before classification.A reduced dimension can result in fewer computations, a reduction in cost and time, and even improved classification accuracy.Additionally, one typically needs fewer training observations to estimate population parameters because the necessary training sample size is directly related to the feature dimension.If the number of training observations can be reduced without a significant increase in the probability of misclassification (PMC), the classification task becomes more efficient in terms of time and cost.
Many different competing feature-reduction methods exists.The two methods we discuss are linear transformations of the feature vector x ∈ R p×1 , which are of the form x → y = T t x (2.1) with T ∈ R p×q , where p is the original full dimension and q is the transformed reduced dimension.The matrix T is known as a linear feature-selection or linear dimension-reduction matrix.We desire that 1 ≤ q p and that the PMC remains essentially the same as in the full-dimension case.
Dimension-reduction methods are beneficial in the case when the ratio of the training sample size n to the dimension of the feature vector p is small (n/p < 4).If 1 ≤ n/p < 4, then one can encounter a problem with accurately inverting the covariance matrices due to extreme bias from small eigenvalues of the covariance matrices.Reducing the feature dimension gives a more stable estimated covariance matrix and estimated inverse covariance matrix by decreasing the number of parameters to be estimated.
In the next two subsections, we review two linear feature-selection methods for the case of unequal covariance matrices.

Tubbs, Coberly, and Young's linear feature-selection method (TCY)
The objective of TCY is to determine a matrix to perform a linear transformation such that the PMC in the reduced q-dimensional transformed feature space is approximately the same as in the original p-dimensional feature space, or P MC(p) ≈ P MC(q).The following theorem describes the motivation for TCY.
Theorem 1. (Tubbs, Coberly and Young, 1982): Let Π i be a p-dimensional multivariate normal population with a priori probability α i , mean µ i ∈ R p×1 , and symmetric nonnegative-definite covariance and let FG be a full-rank decomposition of M such that M = FG with rank(M) = rank(F) = rank(G) = q, 1 ≤ q < p. Then the p-variate Bayes procedure assuming equal cost loss assigns x to Π i if and only if the q-variate Bayes procedure assuming equal cost loss assigns F + x to Π i , i = 1, 2, . . ., k, where F + denotes the Moore-Penrose generalized inverse of F (Harville, 1997, pg. 493).Moreover, q is the smallest positive integer such that there exists a q × p compression matrix preserving the Bayes assignment of x to Π i .
Theorem 1 yields a linear transformation F + ∈ R q×p such that P MC(p) = P MC(q) provided rank(M ) = q < p.If rank(M) = p, there exists no q ×p matrix that preserves the full-feature P MC and, thus, we seek a matrix T ∈ R p×q such that P MC(p) ≈ P MC(q).
The parameters µ i and Σ i , i = 1, 2, . . ., k, are rarely known and, therefore, sample estimates must be obtained using the n i training samples.An estimator of M is then Let n i be the sample size for estimating the parameters of the i-th population.
If n i ≥ p, then rank( M) = p with probability one.In this case, Theorem 1 cannot be directly applied, so Tubbs, Coberly and Young (1982) use the singular value decomposition (SVD) to obtain a best approximation of M (under the Frobenius norm) in a smaller dimension q < p.
Let M = PD p Q t be the SV D of M, where D p ≡ Diag(λ 1 , λ 2 , . . ., λ p ) with λ i ≥ λ j for 1 ≤ i ≤ j ≤ p and let F = PD p .Define D p = Diag(λ 1 , λ 2 , . . ., λ q , 0 q+1 , . . ., 0 p ) with λ i ≥ λ j for 1 ≤ i ≤ j ≤ q.Then Mq = PD q Q t is a rank q approximation of M and Fq = PD q is a rank q approximation of PD p .A q × p feature-reduction matrix to perform the linear transformation in equation (2.1) is then F+ q = P+ q where Fq = [P q : 0] ∈ R p×p .Because TCY allows for unequal means and unequal covariance matrices, F+ q should perform well when the population covariances are unequal and the number of large singular values of M is small relative to p. Also, the method should perform well when n i is large and rank(M)= q p because the estimators xi and S i , i = 1, 2, . . ., k, are strongly consistent.

Brunzell and Eriksson's linear feature-selection method (BE)
Tubbs, Coberly, and Young (1982) explicitly use P MC as their dimensionreduction optimality criterion, whereas Brunzell and Eriksson implicitly consider the P MC via a derived distance measure.This distance measure is used to obtain an upper bound on the expected P MC denoted by EP M C. For two multivariate populations with prior probabilities α 1 and α 2 with In the case of two populations with equal covariance matrices and equal prior probabilities, ∆ 12 is the squared generalized Mahalanobis distance between class means µ 1 and µ 2 .Brunzell and Eriksson (2000) introduce a generalized separation measure for the case of k populations with possibly unequal covariance matrices.Assuming the prior probabilities are equal, they obtain the separation measure (2. 3) The objective of BE is to determine a linear dimension-reduction matrix with q p such that the full-dimension separation measure is at least approximately preserved.The following theorem provides motivation for the BE method.
Consider now the case where µ i and Σ i , i = 1, 2, . . ., k, are unknown and must be estimated from the n i training samples.An estimator of U is then and if r > q, then Û does not yield a q × p linear dimension-reduction matrix.Therefore, BE, like T CY , utilizes the SV D rank-q approximation of Û to obtain a linear feature-reduction matrix that compresses p-dimensional observation vectors into a q-dimensional transformed feature space where 1 ≤ q < p.
Let Ũ = RD p S t be the SV D of the matrix Û, where is a rank q approximation of Û and H is a rank q approximation of Ĥ.A q × p feature-reduction matrix to perform the linear transformation in equation (2.1) is then Ĥt q = R q , where Ĥq = R q and Ĥt 1 = [R q : 0] ∈ R p×p .The BE technique classifies the data based essentially on the rotated difference in the means rather than on the differences in the covariance structures.Note that separation measure (2.3) uses a type of pooled covariance matrix.By pooling the pairs of covariance matrices, Eriksson and Brunzell are not necessarily using all of the information in the differences of the covariance matrices.However, pooling is beneficial in dealing with the near singularity of S i , which occurs when n i is small relative to p so that pooling S i and S j gives more stable values of (S i + S j ) −1 to estimate (Σ i + Σ j ) −1 , 1 ≤ i < j < k.Therefore, BE should perform well relative to T CY when covariance matrices are similar, when essentially all of the discrimination information is in the means, and when n i /p is small.However, BE may lose classificatory information by pooling acutely dissimilar pairs of covariance matrices.
We note that BE is limited to a feature-reduction dimension that depends on the number of classes k.For k = 2, BE allows one to reduce the data to only one dimension regardless of the full-feature vector dimension.When k = 2, Ĥ reduces observations to a one-dimensional reduced feature space and, therefore, one may lose discriminatory information.In general, if we have k classes, BE can reduce the feature vector to at most q = k(k − 1)/2 dimensions because U, given in (2.4), has k(k − 1)/2 columns.This restriction is potentially a major disadvantage, especially in the case when k/p is small.A larger reduced dimension may be more beneficial in preserving or improving the full-dimension error rate.On the other hand, T CY allows one to reduce the original feature vector to any dimension q, 1 ≤ q < p, for any k populations.

A Simulation Study
We conducted a Monte Carlo simulation to compare the performance of T CY and BE using six different population configurations.We generated 1000 training and test sets from each multivariate normal distribution for each parametric configuration.We obtained estimates of the configuration parameters using the training data, and the test data were classified using the quadratic discriminant function (QDF ).We computed F+ q and Ĥ+ q , and found the estimated expected error rates by averaging the estimated conditional error rate over all training samples.We compared T CY and BE in terms of their estimated EP M C and contrasted this with estimated EP M C for the full-feature dimension.Also, we used n i = 2p and n i = 10p to determine the effect of training-sample size on the two methods.For each configuration we calculated the ranks of M and U, along with SV (M) and SV (U), where SV (A) represents the set of singular values of some matrix A. For T CY , the number and values of the non-zero elements of SV (M) indicate the appropriate reduced dimension q for which little classificatory information is lost.To predict the performance of BE, we calculated rank(U) and the average generalized Mahalanobis distance (AGM D) among the means, defined as A relatively large value of AGM D (AGM D > 3) indicates that most of the classificatory information lies in the means, and thus BE is more likely to perform well.In Table 1 we summarize the values of these descriptive measures for each parameter configuration.
In the following sections, we discuss the simulation results for the six configurations considered.
The results are given in Table 2 and show that T CY outperforms BE for this configuration, but BE does surprisingly well.
The average P MC is actually reduced by both dimension-reduction methods when n = 14.The main reason for this phenomenon is that when n is small relative to p, not enough data is available to estimate the p(p + 3)/2 parameters for each population.By reducing the full-feature dimension p to the reduced dimension q, we can increase the ratio of n to p(p+3)/2 and thus obtain improved estimates for the reduced set of parameters.

Rank(Σ
In this setting the two populations have the same covariance matrices as the configuration in 3.1, but the means are now closer together.The population mean parameters are µ 1 = 0 and µ 2 = [2, 1, 0, 0, 0, 0, 0] t .Again, almost all classificatory information can be captured with q = 1 because of the relative sizes of the elements of SV (M).A relatively small AGM D indicates that a majority of the classificatory information is in the covariance matrices.Thus, BE should perform relatively poorly because it considers only the discriminatory information in the means.The results for this configuration are shown in Table 3.
As expected, BE does not perform as well in the reduced dimension q = 1 as T CY .For the small training-sample size, the classification results for T CY are slightly better than the full-dimension results.
The value AGM D = 1.52 suggests that most of the classificatory information is in the covariance matrices, which should diminish the performance of BE.Note that the covariance matrices are relatively different: one is spherical while the other two are elliptical.Thus, the pooled estimates of the covariance matrices (S i + S j ) used in BE significantly differ from the individual covariance matrices, S i , i = 1, 2, . . ., k.The simulation results are presented in Table 4.As expected, T CY performs better than BE regardless of the training-sample size and the reduced-dimension size q.This result is mainly due to the fact that T CY uses classificatory information in the covariance matrices that is unused in BE.Neither dimension-reduction method performs as well as the full-feature dimension.However, both methods perform at least as well at q = 2 dimensions than at q = 3 dimensions.The reason for this phenomenon is that using more dimensions as necessary results in adding "noise" or additional variability into the dimension-reduction approximation.Thus, this additional noise yields a linear feature-selection matrix for q = 3 that is worse than the linear feature-selection matrix at q = 2 in terms of the average P MC.Also, notice that the performance of BE is fairly constant for all the reduced dimensions.The reason is that the population means are aligned in one dimension and, therefore, BE gains no information from additional dimensions.
All six elements of SV (M) are non-zero and relatively large indicating that some classificatory information may be lost with T CY when q ≤ 5. Note that all of the covariance matrices are spherical so BE should not lose information from pooling because the covariance matrices span the same space.Also, note that AGM D = 4.52, which indicates that most of the discriminatory information is in the means and that the BE method should be superior.The results for this configuration are shown in Table 5.For this configuration BE performs somewhat better than T CY .One reason is that BE gains from pooling the individual covariance matrices because they are proportional.Thus, all of the classification information is in the differences of the means.Therefore, T CY is actually adding noise or variability to the reduced-dimension representation by including the differences in the covariance matrices.
The first components of the feature vector are the most informative.Because the common covariance matrix is highly ellipsoidal, the estimated group means differ in a low-variance space but vary in a high-variance space.Note that the elements in SV (M) indicate that q = 2 should be the optimal reduced dimension for T CY .The BE method should perform well because the population covariance matrices are equal and, thus, pooling the sample covariance matrices is beneficial.Also, all of the discriminatory information is in the difference of the means as summarized by the fact that AGM D = 10.The results for this configuration are shown in Table 6.
Here, BE is far superior to T CY , as expected.For T CY we see that the estimated average P MC is very high for q ≤ 3 due to the variability of the differences of the sample means.That is, the vector space spanned by [x 3 − x1 , x2 − x1 ] can greatly vary.Therefore, a conditional reduced-feature space can be drastically different from the optimal reduced-feature space.

Means differ in the Last p − 1 features, equal elliptical covariance matrices
This example was also studied by Friedman (1989).We modeled the three populations using the same elliptic covariance matrix as in Section 3.5.However, in this configuration the means differ in a high-variance space.The parametric configuration is µ 1 = 0, µ 2i = 2.5(i − 1) √ e i /(0.5p √ p − 1), and All classificatory information can be captured with two dimensions.Again, we expect BE to perform better than T CY since the population covariance matrices are equal and all of the classificatory information is in the means.
For this configuration BE and T CY perform similarly (Table 7).The T CY method performs considerably better than in configuration 3.5 because the sample means now differ in a high-variance subspace and vary in a low-variance subspace.For this configuration BE benefits from the pair-wise pooling of the covariance matrices while T CY benefits from increased stability in the sample means.

A Parametric Bootstrap Simulation
In the following simulation, we use a real data set to estimate the population means and covariance matrices.We perform our Monte Carlo simulation with a parametric bootstrap using three populations: N (x 1 , S 1 ), N(x 2 , S 2 ), and N (x 3 , S 3 ).
The data set considered is from a preliminary study by G. R. Bryce and R.M. Barker at Brigham Young University (Rencher, 1995) on a possible link between football helmet design and neck injuries.Six different head measurements were taken on each individual, and the study included three classes with thirty subjects in each class.The three classes are Π 1 High-school football players, Π 2 College football players, Π 3 Non-football players.
The results of the parametric bootstrap simulation are presented in Table 8.The two methods perform similarly for this configuration.As predicted, T CY performs better when q = 2, and BE performs surprisingly well, considering the moderate value of AGM D = 2.90.However, neither method improves the misclassification error when compared to the full dimension.The gain in the training-sample size to parameter-dimension ratio is offset by a loss of information in the reduced-feature space.We first note that at q = 3 neither linear dimension-reduction technique yields a EP M C close to the full feature EP M C. Also, in view of the moderate value of AM GD, the BE linear feature-selection method is, somewhat surprisingly, roughly equivalent to the T CY linear feature-selection method in terms of the reduced-space average P MCs.

Concluding Remarks
We first remark that BE benefits from pooling the pairs of covariance matrices when they are similar.The performance of BE is enhanced if most of the classificatory information is contained in the means.This is achieved through the rotation of the pairs of means by (S i − S j ) −1 into a feature space that preserves or nearly preserves the AGM D.

Table 1 :
Description of simulation configurations

Table 6 :
Means differ in first p − 1 features with equal elliptical covariance

Table 7 :
Means differ in last p − 1 features and equal elliptical covariance matrices (k = 3)