Regression: Comparing Predictors and Groups of Predictors Based on a Robust Measure of Association

Abstract: Let ρj be Pearson’s correlation between Y and Xj (j = 1, 2). A problem that has received considerable attention is testing H0: ρ1 = ρ2. A well-known concern, however, is that Pearson’s correlation is not robust (e.g., Wilcox, 2005), and the usual estimate of ρj , rj has a finite sample breakdown point of only 1/n. The goal in this paper is to consider extensions to situations where Pearson’s correlation is replaced by a particular robust measure of association. Included are results where there are p > 2 predictors and the goal to compare any two subsets of m < p predictors.


Introduction
Let Y i , X i1 , . . ., X ip (i = 1, . . .n) be a random sample from some unknown p + 1-variate distribution, let ρ yj be Pearson's correlation between Y and X j , and let ρ jk be the correlation between X j and X k (j, k = 1, . . ., p; j = k).A problem that has received considerable attention is testing (1.1) (e.g., Hittner, May and Silver, 2003;Hotelling, 1940;Olkin, 1967;Dunn and Clark, 1969;Meng, Rosenthal and Rubin, 1992); Steiger, 1980;Williams, 1959;Wilcox and Tian, 2008;Hendrickson, Stanley and Hills, 1970).Most methods are now known to be unsatisfactory in terms of Type I errors (e.g., Hittner et al. , 2003;Wilcox and Tian, 2008).Indeed, Wilcox and Tian (2008) found only one method that performs well in simulations.The goal in this paper is to extend methods for testing (1.1) in two ways.The first is to compare any two predictors based on a particular robust measure of association with Y .The second is to consider a generalization of the first method to situations where the strength of the association between m variables and Y is of interest, 1 < m < p.To be slightly more precise, consider two non-identical subsets of the integers 1, . . ., p, both having cardinality m.Denote these subsets by M 1 and M 2 , let X M j represent the predictor variables indexed by M j (j = 1, 2), and let τ yM j be some measure of the strength of the association between Y and X M j .The goal is to test (1.2) Several methods were considered for the case m = 1, only one of which performed reasonably well in simulations, in terms of Type I errors and power.The method extends in an obvious way to m > 1, but it does not control the probability of a Type I error reasonably well in simulations.The one method that did perform well in simulations, when m > 1, is based in part on Tukey's notion of halfspace depth.It can be used when m = 1, but in terms of power it is not recommended for this special case for reasons that will be made evident.
The suggested method when m = 1 is based on a particular robust covariance.As is well known, there are many robust analogs of the usual covariance matrix that result in robust analogs of ρ, which include the minimum volume ellipsoid (MVE) estimator (Rousseeuw and Leroy, 1987), the (fast) minimum covariance determinant (MCD) estimator (e.g., Rousseeuw and van Driesse, 1999), an estimate based on the median ball algorithm proposed by Olive (2004), the OGK estimator suggested by Maronna and Zamar (2002), the TBS (translated biweight) estimator derived by Rocke (1996), and various skipped estimators (e.g., Wilcox, 2005).Here the focus is on Olive's estimator, primarily because it is relatively simple and easy to compute.It is not being suggested that other robust measures of scatter should be eliminated from consideration.However, some results are briefly noted using the MVE and the (fast) MCD estimators to illustrate that the choice of estimator has practical consequences regarding the approximation of an appropriate critical value.

Some Background
This section reviews some methods and results needed in this paper.

The median ball algorithm
For completeness, the computational details associated with Olive's estimator, which is based on what he calls the reweighted median ball algorithm (RMBA), are outlined here.His approach begins with two estimates of location and scatter, (T 0,j , C 0,j ) (j = 1, 2).Compute all n Mahalanobis distances D i (T 0,j , C 0,j ), i = 1, . . ., n, based on the jth measure of location and scatter.The next iteration consists of estimating the usual mean and covariance matrix based on the c n ≈ n/2 cases corresponding to the smallest distances, yielding (T 1,j , C 1,j ).Repeating this process, based on D i (T 1,j , C 1,j ), yields an updated measure of location and scatter, (T 2,j , C 2,j ).As done by Olive, (T 5,j , C 5,j ) is used here.The first of the two starting values used by Olive takes (T 0,1 , C 0,1 ) to be the usual mean and covariance matrix.The other starting value, (T 0,2 , C 0,2 ), is the usual mean and covariance based on the c n cases that are closest to the coordinate wise median in Euclidean distance.Let (T A , C A ) = (T 5,j , C 5,j ), where j = 1 if the determinant |C 5,1 | ≤ |C 5,2 |, otherwise j = 2.The RMBA estimate of location is T A and the measure of scatter is where MED indicates the sample median and χ 2 p,0.5 is the .5 quantile of a chisquared distribution with p degrees of freedom.Olive and Hawkins (2006) show that the RMBA estimate is , where µ j and σ j are the mean and standard deviation of X ij .In the event the population mean and variance are not known, their usual estimates are used instead.Momentarily, focus on p = 2.The (sample) generalized variance associated with Y i and Z ij reflects how tightly clustered together the n pairs of points happen to be.From basic principles, the population analog is the determinant of the usual covariance matrix, namely where σ 2 y is the variance of Y .And of course, if Y is standardized as well, the generalized variance reduces to 1 − ρ 2 yj .So if we take τ yj to be the generalized variance between Y and Z j , testing (1.2) reduces to testing (1.1).However, a well-known practical concern is that the usual generalized variance is not robust, roughly meaning that a single outlier can give a distorted sense of the strength of the association among the bulk of the observations.
Standardizing Y , in addition to standardizing X, is perhaps the seemingly more natural way of measuring the strength of an association.But this is not done here because when both are standardized, it becomes more difficult to control the probability of a Type I error when comparing the strength of the association corresponding to any two predictors.In particular, a method was found that performs well in simulations when Y is not standardized, but when Y is standardized, the method performs poorly in simulations and alternative methods that were considered also fail to perform well.Note, however, that even though Y is not standardized, the relative strength of the association corresponding to two subsets of variables can measured with τ y1 /τ y2 .
Again consider two subsets of predictors indexed by M j (j = 1, 2), both sets having cardinality m.One way of measuring the strength of the association between Y and the Z M j variables is to use some robust analog of the generalized variance and here, when m = 1, the strategy is to simply replace the usual covariance matrix with the RMBA estimate of scatter.Now the estimate of τ yj , τyj , is taken to be determinant of this matrix and testing (1.2) provides a way of comparing the strength of association between these two sets of variables.Of course, another possible way of measuring the strength of association between Y and two or more predictors is to fit some robust regression line and use some obvious analog of the usual squared multiple correlation coefficient.Perhaps there is some practical advantage associated with this alternative approach, but this issue is not pursued here.

Methods Considered for Testing (1.2) When m = 1
Momentarily, attention is focused on the case p = 2, m = 1.The first method for testing (1.2) that was considered here was a basic percentile bootstrap method (e.g., Efron and Tibshirani, 1993;Liu and Singh, 1997).It performs relatively well for a variety of situations where the goal is to compare robust measures of location or scatter (e.g., Wilcox, 2005).However, for the problem at hand, this approach was found to be unsatisfactory.A basic concern is that the actual level of the test is too sensitive to the distribution used to generate data in simulations.So in particular, if some adjustment is made so that, under normality, this approach has a Type I error probability equal to some specified value, the actual level of the test when generating data from a non-normal distribution can exceed the nominal level by a considerable amount.Consequently, this approach was abandoned.
The second strategy considered was to use an obvious test statistic based in part on a bootstrap estimate of the standard error of τyj .
this provided more stability in simulations (estimated Type I errors are less affected by the distribution used to generate the data) and so the test statistic used here is There remains the problem of approximating the null distribution of T .Still focusing on the case p = 2, m = 1, consider the situation where sampling is from a trivariate normal distribution with correlations ρ y1 = ρ y2 = ρ 12 = 0.The initial strategy was to use simulations to determine an appropriate critical value for this special case and then use this critical value under non-normality.But this was not quite satisfactory.When sampling from a skewed distribution with all of the correlations set equal to .5, the actual Type I error probability, when testing at the .05level, exceeded .075 in some cases.
To elaborate, let Z be a standard normal distribution, in which case has a g-and-h distribution where g and h are parameters that determine the first four moments.When g = h = 0, X has a standard normal distribution.
With g = 0 this distribution is symmetric and it becomes increasingly skewed as g gets large.As h gets large, the g-and-h distribution becomes more heavytailed.Table 1 shows the skewness (κ 1 ) and kurtosis (κ 2 ) for each distribution considered in the simulations used here.They correspond to a standard normal (g = h = 0), a symmetric heavy-tailed distribution (h = 0.2, g = 0.0), an asymmetric distribution with relatively light tails (h = 0.0, g = 0.2), and an asymmetric distribution with heavy tails (g = h = 0.2).(Additional properties of the g-and-h distribution are summarized by Hoaglin, 1985.)It was found that altering h, with g fixed, has little or no effect on the probability of a Type I error.But altering g (increasing skewness) did have some effect.Also, increasing the correlation increased somewhat the probability of a Type I error, but as ρ approaches 1, the reverse was found to be true.Based on these results, it was decided to determine critical values when sampling from a g-and-h distribution with h = 0.0, g = 0.2 and a common correlation ρ = .5.In particular, .05critical values were estimated using the method just described for n = 30, 50, 100, 200 and 400 and it was found that the .05critical value is given approximately by 2.06 − 5.596 √ n .
Note that if asymptotic normality is assumed, the intercept 2.06 would be 1.96.When using this critical value, situations are avoided where in the simulations to be described, the estimated Type I error probability exceeds .075when the nominal level is .05.Simultaneously, for the normal and other distributions considered, the estimated probability generally does not drop below .025.One exception is when n = 30, h = 0.0, g = 0.2 and the common correlation is ρ = .8, in which case the estimate was .024.This will be called method R1.

Methods Considered for Testing (1.2), m > 1
Now consider the case p > 2 with m = 2.The strategy for determining a critical value, just outlined, extends in an obvious way to the situation at hand.But now simulations indicate that the actual Type I error probability is much more sensitive to the distribution used to generate data.If adjustments are made to avoid Type I errors that are well above the nominal level, there are situations (including normality) where the method is too conservative, meaning that when testing at the .05level, the actual level is estimated to be less than .01.Moreover, power can be poor relative to the method suggested here.
Currently, only one method has been found that controls Type I error probabilities reasonably well, which is based in part on Tukey's notion of halfspace depth.Roughly, the idea is that if Y and X M 1 have a stronger association than Y and X M 2 , this will be characterized by the points (Y, Z M 1 ) being more tightly clustered together than the points (Y, Z M 2 ).And the extent to which this is true can be investigated with a slight extension of results in Liu and Singh (1993), which is described in Wilcox (2005, p. 256).
Tukey's halfspace depth represents an extension of the notion of ranks to multivariate data.It reflects how deeply a point is nested within a multivariate data cloud without reliance on any type of covariance matrix.To describe its formal definition, consider any point x where x is a column vector having length p, let H be any closed halfspace containing the point x, and let P (H) be the probability associated with H. Then Tukey's halfspace depth is There are relatively simple and reasonably accurate methods for approximating the halfspace depth of point and here the method given by Wilcox (2005, p. 207) will be used.
Momentarily consider the case of two independent, multivariate distributions, F and G. Let U 1 , . . ., U n 1 and V 1 , . . ., V n 2 be random samples from F and G, respectively, and let That is, R(v; F ) is the probability that the depth of a randomly sampled U, relative to F , is less than or equal to the depth of some particular point, v, again relative to F .Said another way, R(v; F ) is the fraction of the F population that is less central than the value v.For V i , R(V i ) is estimated with the proportion of points among U 1 , . . ., U n 1 that have a smaller halfspace depth than V i , relative to F .The quality index proposed by Liu and Singh (1993) is the average of all R(v; F ) values with respect to the distribution G. Put another way, for a randomly sampled U and V, is the probability that the depth of V is greater than or equal to depth of U .Liu and Singh show that the range of Q is [0, 1] and when where F n 1 is the usual empirical distribution associated with the first group, and an estimate of To determine whether the distributions differ in scatter, one can test and the basic percentile bootstrap method appears to perform well in terms of Type I errors (Wilcox, 2005, p. 256).That is, randomly sample with replacement n j vectors of observations from the jth group, estimate Q(F, G) and Q(G, F ), yielding say Q * (F, G) and Q * (G, F ), and let Repeat this process B times and put the results in ascending order yielding D , where = αB/2, rounded to the nearest integer, and u = B − .Returning to the problem at hand, simple modifications of the method just outlined were considered, with all but one performing poorly in simulations.For example, taking bootstrap samples of all n vectors and computing D * was considered.Another variation was to use independent bootstrap samples when computing Q * (F, G) and Q * (G, F ). Another strategy is to split the data into two independent components and use the method outlined in the previous paragraph.Certainly, this approach is not ideal, but in terms of controlling the Type I error probability, it is only approach found that performs reasonably well in simulations.
More formally, let n 1 = n/2, rounded down to the nearest integer.Then, under random sampling, and the issue is whether the (population) amount of scatter corresponding to these two groups differs.And here the hypothesis that they do not differ is tested with the method just outlined.This will be called method R2.
A variation of this method also controls Type I errors in the simulations considered here.First test the hypothesis using method R2 as just described, and then test it again, only now using (Z i,M 2 , Y i ) (i = 1, . . ., n 1 ) and (Z i,M 1 , Y i ) (i = n 1 + 1, . . ., n).Denote the corresponding p-values with p 1 and p 2 .Then apply a sequentially rejective method designed to control the probability of at least one Type I error (e.g., Rom, 1990).That is, reject at the α level if both p 1 and p 2 are less than or equal to α, or if either one is less than or equal to α/2.However, this approach was found to have less power than method R2, so further details are omitted.

Two Illustrations
Data for X were generated from a trivariate normal distribution with ρ 12 = .5,and Y was taken to be Y = X 1 + .5X 2 + .1X 3 + with standard normal.
Here the sample size is n = 200.Figure 1 shows a scatterplot of Y versus Z 1 and Z 3 .The points marked by * correspond to the first predictor and points marked by o are for the third predictor.Notice how the first set of points tend to be nested within the points marked by an o, indicating that the first predictor has a stronger association with Y .And the extent to which this is true can be quantified with halfspace as previously described.The polygons denote the central half of the data as measured by halfspace depth.The .95 confidence interval for Q(F, G) − Q(G, F ) is (0.15, 0.46).If instead the goal is to compare the strength of association based of the first two predictors versus predictors 2 and 3, the .95confidence interval is (0.176, 0.486), indicating that the first two predictors, taken together, have a stronger association with Y .The following illustrates that the choice of method can make a practical difference when analyzing actual data.In an unpublished study by L. Doi, of interest were predictors of reading ability among children.One portion of the study dealt with how well measures of accuracy and speed, when identifying lower case letters, predicts scores on a reading comprehension test.Figure 2 shows a scatterplot of the points after the predictors have been standardized.As is evident, the points indicated by an *, which correspond to a measure of accuracy, are more tightly clustered together except for a few points in the lower right portion of the plot.If Pearson correlations are compared using the method in Zou (2007), no difference is found at the .05level.The sample size is 81 and the estimates of Pearson's correlation for accuracy and speed are −.11 and −.33, respectively.Note that Figure 2 suggests that, there is a seemingly strong positive association.Pearson's correlation is negative due to the outliers in the lower right corner.The correlation based on the RMBA covariance matrix is .59.

Simulation Results
Simulations were used to check the ability of methods R1 and R2 to control the probability of a Type I error when the marginal distributions have one of the four g-and-h distributions previously described.The predictors were taken to have a common correlation of ρ = 0 or .5.Correlations among the predictors were formed as follows.Let R be the desired correlation matrix and form the Cholesky decomposition U U = R, where U is the transpose of U. If X represents an n × p matrix of data where the marginal distributions are independent, then XU produces an n × p matrix of data that has population correlation matrix R.
Table 2 reports the estimated Type I error probabilities for methods R1 and R2 when n = 40.The results for method R2 are for the case p = 3, M 1 = {1, 2} and M 2 = {2, 3}.That is, the goal is to test the hypothesis that the strength of association based on predictors 1 and 2 is the same as for predictors 2 and 3.As is evident, Type I error probabilities are controlled reasonably well among all of the situations considered.

Concluding Remarks
There are numerous variations of method R2 and perhaps some of them have practical value.For example, there are a variety of ways of measuring the depth of a point in a multivariate data cloud other than the approach used here (e.g., Wilcox,chapter 6).The main point is that many methods have been ruled out and two methods were found that perform reasonably well in terms of Type I errors.Method R1 is best, in terms of power, when m = 1, but for m > 1, only method R2 was found to be reasonably satisfactory in simulations.
Regarding method R1, of interest is whether some alternative robust covariance can be used with the same .05critical value used here.The simulations were run again using the MVE and MCD estimators and it was found that now Type I error probabilities are not controlled well.Perhaps other robust covariances give satisfactory results, but this remains to be determined.
It is noted that a few simulations were run using method R2 with p = 4 and m = 2 and 3.The results were very similar to those in Table 2. Of course, perhaps method R2 breaks down with p sufficiently large, but this possibility is left for future studies, Finally, both R and S-PLUS software are available from the author for applying both methods R1 and R2; ask for the functions regpord and regpordv2.

Figure 1 :
Figure 1: Scatterplot corresponding to two predictors
Let τ * yj be the estimate of τ yj based on this bootstrap sample.Repeat this process B times yielding the B bootstrap estimates τ * yjb , b = 1, . . ., B. Then an estimate of the squared standard error of τy1 To elaborate, let Y * i , X * i1 , . . ., X * ip be a bootstrap sample obtained by randomly sampling with replacement n vectors of observations from Y i , X i1 , . . ., X ip , (i = 1, . . .n).
This performed considerably better in simulations, but the estimated probability of a Type I error was still judged to be a bit too unstable as a function of the distributions generating the data.Note that in the bootstrap world, τ * yj =

Table 1 :
Some properties of the g-and-h distribution