A Bayesian Approach to the Multiple Comparisons Problem

Consider the problem of selecting independent samples from several populations for the purpose of between-group comparisons. An important aspect of the solution is the determination of clusters where mean levels are equal, often accomplished using multiple comparisons testing. We formulate the hypothesis testing problem of determining equal-mean clusters as a model selection problem. Information from all competing models is combined through Bayesian methods in an effort to provide a more realistic accounting of uncertainty. An example illustrates how the Bayesian approach leads to a logically sound presentation of multiple comparison results.


Introduction
Consider the problem of selecting independent samples from several populations for the purpose of between-group comparisons, either through hypothesis testing or estimation of mean differences.A companion problem is the estimation of within-group mean levels.Together, these problems form the foundation for the very common analysis of variance framework, but also describe essential aspects of stratified sampling, cluster analysis, empirical Bayes, and other settings.
Procedures for making between-group comparisons are known as multiple comparisons methods.The goal of determining which groups have equal means requires testing a collection of related hypotheses.We examine this hypothesis testing problem from a Bayesian viewpoint.In Section 2, we detail how the determination of equal mean clusters can be formulated as a Bayesian model selection problem.Posterior model probabilities are computed via the Bayesian information criterion.Bayesian model averaging is introduced as a tool for combining information from all competing models in an effort to provide a more realistic accounting of uncertainty.An example in Section 3 illustrates how the Bayesian approach to multiple comparisons testing leads to a logically sound interpretation of the results.
Section 4 addresses Bayesian estimation of within-group mean levels for the multiple comparisons problem.Similar to Stein estimation, a type of shrinkage is performed.The approach is more general, however, because shrinkage is not necessarily toward an overall mean, but rather toward means deemed likely to be equal.

The Multiple Comparisons Problem
Consider independent samples from I normally distributed populations with equal variances: X 11 , X 12 , . . ., X 1n 1 ∼ iid N (µ 1 , σ 2 ) . . .X I1 , X I2 , . . ., X In I ∼ iid N (µ I , σ 2 ). (2.1) The goal of the multiple comparisons problem is to determine where within-group means are equal in order to create clusters of groups with equal mean levels.Thus, one is testing H (a,b) : µ a = µ b for each (a, b); a total of I(I − 1)/2 distinct but related hypotheses.A typical frequentist test will decide in favor of H (a,b) when The definition of Q a,b depends upon the approach.A point of difficulty common to classical multiple comparison testing procedures of this form is illustrated by the case where one decides in favor of µ 1 = µ 2 and in favor of µ 2 = µ 3 , but against µ 1 = µ 3 .Such cases are difficult to interpret since a single choice of a clustering is not obtained.Employing a Bayesian philosophy, one might be inclined to state the goal as quantifying the evidence in favor of H (a,b) : µ a = µ b for each (a, b).The existence of equal mean levels is considered physically plausible for the multiple comparisons problem, so Bayesian testing of these precise hypotheses will require a measure of prior/posterior belief in H (a,b) , and a measure of prior/posterior belief in the effect size δ (a,b) (a,b) is not true.Thus, the distribution over (µ 1 , . . ., µ I ) need consist of two components to reflect the possibility of equal means.A measure of belief in the precise nulls {H (a,b) } will be represented by point mass probabilities, while a continuous portion of the distribution will reflect belief in the size of the differences between means when H (a,b) is not true.
The determination of prior probabilities over the hypotheses {H (a,b) } is complicated by the fact that the collection does not consist of mutually exclusive events.For example, H (1,2) true (µ 1 = µ 2 ) may occur with H (2,3) true (µ 2 = µ 3 ) or with H (2,3) false (µ 2 = µ 3 ).One cannot develop a prior by comparing relative beliefs in each of the hypotheses.
It is clear that the hypotheses must be taken as a whole when assigning prior belief.Allowable decisions can be reached through the formation of equal mean clusters among the I populations.For example, the clustering µ 1 = µ 2 , µ 3 = µ 4 implies H (1,2) true, H (3,4) true, and all others false.Designating a clustering of equal mean levels will define a model nested within (2.1).When two or more means are taken as equal, we merely combine all relevant samples into one.The smaller model is of the same form as (2.1),only for I < I.The problem can now be stated in terms of Bayesian model selection, where each allowable combination of hypotheses will correspond to a candidate model.
We provide a short review of Bayesian model selection in the general setting using the notation of Neath and Cavanaugh (1997).Let Y n denote the observed data.Assume that Y n is to be described using a model M k selected from a set of candidate models {M 1 , . . ., M L }. Assume that each M k is uniquely parameterized by θ k , an element of the parameter space Θ(k).In the multiple comparisons problem, the class of candidate models consists of all possible mean level clusterings.Each candidate model is parameterized by the mean vector µ = (µ 1 , . . ., µ I ) and the common variance σ 2 , with the individual means restricted by the modeldefined clustering of equalities.That is, each model determines a corresponding parameter space where particular means are taken as equal.
Let L(θ k |Y n ) denote the likelihood for Y n based on M k .Let π(k), k = 1, . . ., L, denote a discrete prior over the models M 1 , . . ., M L .Let g(θ k |k) denote a prior on θ k given the model M k .Applying Bayes' Theorem, the joint posterior of M k and θ k can be written as where h(Y n ) denotes the marginal distribution of Y n .The posterior probability on M k is given by Posterior probability on the hypothesis H (a,b) can be found by summing over the probabilities on those models for which µ a = µ b .This gives a very reasonable approach to determining the evidence in favor of each of the pairwise equalities.
We remark that placing a continuous prior in I dimensions over (µ 1 , . . ., µ I ) will not provide a satisfactory answer to the problem of multiple comparisons testing.Under this approach, P [H (a,b) ] = 0 both a priori and a posteriori .Thus, the problem of interest is not addressed, since the precise hypotheses of primary focus are rendered impossible.
The integral in (2.2) requires numerical methods or approximation techniques for its computation.Kass and Raftery (1995) provide a discussion of the various alternatives.An attractive option is one based upon the popular Bayesian information criterion (Schwarz, 1978).Define where θk denotes the maximum likelihood estimate obtained by maximizing L(θ k |Y n ) over Θ(k).It can be shown under certain nonrestrictive regularity conditions (Cavanaugh and Neath, 1999) that An outline of the proof is given in the Appendix.
The advantages to computing the posterior model probabilities as (2.3) include computational simplicity and a direct connection with a popular and wellstudied criterion for Bayesian model selection.The justification of approximation (2.3) is asymptotic for the general case of prior g(θ k |k), but Kass and Wasserman  (1995) argue how the approximation holds under a noninformative prior on θ k even for moderate and small sample sizes.Now, let ∆ = ∆(θ k ) denote a parameter of interest.For the multiple comparisons problem, focus may be on the difference between means δ (a,b) = µ b − µ a , or perhaps on the components of the mean vector µ.The posterior distribution of ∆ given the data Y n is (2.4) Thus, the posterior of ∆ is found by taking an average of the posterior distributions under each candidate model, weighted by the posterior model probabilities.
The fundamental idea behind Bayesian model averaging (BMA) is to provide a realistic accounting of the uncertainty inherent in selecting a model.Result (2.4) is the foundation of our inferential approach to the multiple comparisons problem.

An Example
We illustrate the Bayesian approach to multiple comparisons testing through an example involving I = 5 population means.The data appear in Montgomery  (1997) with the objective of determining which pairs of means are significantly different using common frequentist methods of multiple comparisons.See Table 1 for the summary statistics and Figure 1 for a graphical display.
Montgomery introduced the data in the context of a completely randomized experiment designed to investigate the relationship between the tensile strength of a new synthetic fiber and the blend of cotton in the fiber.The treatment groups correspond to five different cotton blends.Five fabric specimens are tested for each blend.The response measurements reflect tensile strength (in lb/in 2 ).Treatments are identified in ascending order of the observed sample means.
A glance at the data suggests a potentially strong clustering of µ 1 , µ 2 , and a clustering to a lesser degree among µ 3 , µ 4 , µ 5 .We shall see how these notions can be quantified from the computation of the Bayesian posterior probabilities on the pairwise equalities.Under the setting of independent sampling with normally distributed error terms, the maximized log-likelihood is derived as where β is a constant, n = I i=1 n i , and The Bayesian information for model M k can be defined as A Bayesian model selection rule would favor the model M k which is a posteriori most probable, or equivalently, the model for which (3.2) is minimum.
The top five model choices and posterior probabilities are displayed in Table 2.A clear selection as "best model" is the clustering µ 1 = µ 2 , µ 3 = µ 4 , µ 5 (H (1,2) true, H (3,4) true).It is worthwhile to note that belief in the most probable model being correct is still rather small (π < .5),so the selection of a model without a corresponding measure of uncertainty would be misleading. .0554 Posterior probabilities for the top five most likely pairwise equalities are in Table 3. Posterior pairwise equality probabilities can provide a distinction that their frequentist p-value counterparts cannot.One may fail to reject a null hypothesis of equal means because either the data supports the decision of no effect or there is not enough data to detect an effect.Frequentist p-values alone are not necessarily able to distinguish between the two situations.
The hypothesis µ 1 = µ 2 is well-supported by the data (P [H (1,2) ] ≈ .8),as was suspected.There is also some evidence in favor of µ 3 = µ 4 (P [H (3,4) ] ≈ .6)and a non-negligible probability of µ 4 = µ 5 (P [H (4,5) ] > .1).Yet, there is good evidence against µ 3 = µ 5 (P [H (3,5) ] < .02).Let's take a closer look at the clustering among µ 3 , µ 4 , µ 5 .Tukey's multiple comparison procedure gives a critical range of Q = 5.37.A pair of means is deemed not equal if the corresponding sample difference exceeds Q in magnitude.As can be seen from this example, a single clustering is not necessarily obtained.One reaches the decision of accept µ 3 = µ 4 , accept µ 4 = µ 5 , but reject µ 3 = µ 5 .Of course, this paradoxical decision is explained by the fact that "equals" is only "no statistical significant difference," but the interpretation is still lacking any probabilistic detail.The proposed Bayesian approach bridges this gap and provides a nice presentation for multiple comparisons. .0191 To evaluate the full posterior distribution of µ, a prior must be specified under each model M k .It is our choice to use Jeffreys' noninformative priors for {µ, σ 2 } given M k .(As mentioned earlier, the choice of a noninformative prior works well in companion to the Bayesian information approximation to model probabilities.)The components of the mean vector then have marginal posterior Student t-distributions: where t(m, c, v) represents the t-distribution with location parameter m, scale parameter c, and degrees of freedom v.The data dependent quantities within (3.3) are the sample mean μi(k) for the cluster in model M k containing group i, and where σ2 (k) is given by (3.1).
Conditional on model M k , the posterior distribution of δ (a,b) = µ b − µ a takes one of two forms.If the model restriction forces µ a = µ b (that is, groups a and b are in the same cluster), then a point probability mass is placed at zero.If µ a is allowed to differ from µ b (groups a and b are in different clusters), then the posterior of δ (a,b) , conditional on model choice, is where d Figure 2: BMA posterior distribution for δ (1,2) .
The BMA posterior distribution for δ (a,b) is the mixture of conditional distributions defined in (2.4).A spike at zero equals the sum of probabilities over models for which P [δ (a,b) = 0|M k ] = 1.This sum matches with P [µ a = µ b ].The continuous portion is a mixture of the t-distributions in (3.4).
Figures 2 and 3 display the BMA posterior distributions for δ (1,2) and δ (4,5) .The continuous curve is scaled to where the maximum height equals P [µ a = µ b ] so that one can make a direct comparison between the two portions.

An Estimation Problem
We switch focus to the problem of estimating the within-group mean vector µ.The most obvious approach to this problem is to simply use the within-group sample mean vector μw = (x 1 , . . ., xI ).At first glance, this seems like a clear choice for an estimator.However, μw does not take advantage of any possible relationships between groups.In fact, μw is inadmissible as an estimator for µ under squared error loss when I > 3 (Stein, 1955).Improvement on μw is attained using an estimator where within-group mean estimates exhibit "shrinkage" toward an overall mean.
To demonstrate the concept of shrinkage, or Stein estimation, consider the following hierarchical approach to the estimation problem at hand.We have Suppose where The conditional posterior mean of µ i is a weighted average of the overall mean µ o and the within-group sample mean xi .The weights depend upon the betweengroup variance τ 2 and the within-group variance σ 2 /n i .If between-group variability is small relative to within-group variability, the estimates shrink to near the overall mean µ o .Otherwise, stronger weight is placed on individual sample means.Expression (4.1) requires estimates of the additional parameters (µ o , τ 2 , σ 2 ).Casella (1985) gives an empirical Bayes argument for deriving a point estimate.The estimated weights are seen as functions of the F-statistic for testing equality of all means.If F is small, the data supports the hypothesis of equal means and greater weight is placed on the overall mean.As F gets larger, weight shifts toward the individual means, consistent with the information from the data.
A Bayes empirical Bayes approach places a prior on the parameters (µ o , τ 2 , σ 2 ).The conditional posteriors for implementing Gibbs sampling are readily attainable (Carlin and Louis, 2000).We return to the data from Table 1.Holding with our theme, a noninformative hyperprior is used.Specifically, we take The results are displayed in Table 4. Compare the point estimate μEB from Table 4 to μw from Table 1.The shrinkage effect is evident, but slight.The indication is against equality of all five mean levels.
The idea behind our solution to the estimation problem is that intermediate models exist between the model for which all means are equal and the model for which all means are distinct.In Section 3, we showed how to compute the Bayesian probability on submodels defined by a particular clustering of equal means.An estimate of the mean vector under the BMA framework is given by μBMA = Hoeting, Madigan, Raftery, and Volinsky (1999) present an overview of BMA inference with applications and optimality results.In particular, BMA is seen to improve estimation and prediction, and to adjust interval estimates which tend to be overconfident if one proceeds as if a selected model is correct with probability one.
For multiple comparisons estimation of the mean vector µ, a type of shrinkage is performed in creating the BMA estimate (4.2).However, this is a more general weighted average than a Stein estimate of the form (4.1).Shrinkage does not have to be toward an overall mean, but rather toward means deemed likely to be equal by the data.
Again consider the example.Denote the model of all means distinct as M w and the model of all means equal as M o .We can calculate π(M o |Y n ) < .0001and π(M w |Y n ) = .0554.The posterior probability of M w is much greater than that of M o , which provides an alternate justification that the weight of a Stein estimate should be primarily on the individual means.Yet neither posterior probability is large, meaning neither of the models is well-supported by the data.The BMA estimate shrinks predominantly toward the most probable model Figure 4 provides a graphical comparison of the shrinkage properties of the estimates.One can see the benefits of the BMA estimate in such an example where the data indicates several potential clusters of equal means.
The full posterior distribution on µ i is stated by the mixture in (2.4), with distributions conditional on model choice shown in (3.3).Given a particular model choice as correct, the Bayesian intervals under the noninformative prior coincide with the standard frequentist confidence intervals.As a demonstration, we can focus attention on estimating µ 5 .This is the treatment level where, on the basis of observed sample means, the maximum expected response occurs.The estimation problem involves the uncertainty of determining which model is correct, and the uncertainty of estimation within a given model.The choice of model M * as correct implies without question that no clustering of treatment 5 occurs.We see from Table 2, for example, that P [µ 4 = µ 5 ] = .1200,so the possibility of a clustering with other groups should play a role.
The BMA intervals are wider, reflecting the uncertainty associated with the selection of a clustering.Figure 5 displays the posterior distributions for µ 5 under M * and model averaging.The greater variability under BMA for a better accounting of uncertainty is noticed.Also note the skewness of the BMA posterior due to the shrinkage property.Bayesian model averaging provides a natural approach to incorporating these desirable aspects into a solution for the estimation problem.

Concluding Remarks
The multiple comparisons problem is well known among statistical practitioners.Although fairly simple to state, a challenge to solving the problem lies in that one is testing several related, precise hypotheses.Bayesian inference has an advantage over traditional frequentist approaches to multiple comparisons testing in that degree of belief is quantified.One can avoid the illogical conclusions which arise from an "accept/reject" decision process.The Bayesian approach in this paper is novel in that the precise hypotheses used to define multiple comparisons testing are the hypotheses that are actually being tested.Bayesian approaches derived from continuous prior distributions do not possess this characteristic.We are able to compute the probability on the event of equal means, as the statement of the multiple comparisons test requires.
We use (A.1) and (A.

Table 1 :
Data for example group response (cotton blend) (tensile strength in lb/in 2 ) sample mean sample s.d.

Table 2 :
Posterior model probabilities

Table 3 :
Probabilities of pairwise equalities hypothesis posterior

Table 4 :
Interval estimates from Bayes empirical Bayes Table 5 displays both BMA intervals and the intervals under model M * .

Table 5 :
Interval Estimates from BMA and M *