Power of a Rank-Based Test for Differences Between Treatment Distributions in a Randomized Complete Block Design

: Friedman’s test is a rank-based procedure that can be used to test for differences among t treatment distributions in a randomized complete block design. It is well-known that the test has reasonably good power under location-shift alternatives to the null hypothesis of no difference in the t treatment distributions. However the power of Friedman’s test when the alternative hypothesis consists of a non-location difference in treatment distributions can be poor. We develop the properties of an alternative rank-based test that has greater power than Friedman’s test in a variety of such circumstances. The test is based on the joint distribution of the t ! possible permutations of the treatment ranks within a block (assuming no ties). We show when our proposed test will have greater power than Friedman’s test, and provide results from extensive numerical work comparing the power of the two tests under various configurations for the underlying treatment distributions.


Introduction
In this paper we develop the properties of a nonparametric test for differences among t treatment distributions in a randomized complete block design (RCB) with b blocks. The test statistic is based on the joint distribution of the t! possible orderings of the treatment ranks within a block, and is similar to the nonparametric test proposed by Friedman (1937) in its use of these ranks. Like Friedman's test, the assumptions necessary for the proposed test statistic to have a known, easily computed null distribution are less stringent than those required for the usual analysis of variance F-test. In particular, one need not assume a specific parametric family for the underlying treatment distributions.
Here we develop the test statistic 2 by consideration of several inter-related null hypotheses, and show that the proposed test has better power than Friedman's test for detecting differences in treatment distributions under a variety of conditions. 416 Rank-Based Test for Differences

Hypotheses
For each of t treatments, let ij X denote the response to the j-th treatment in the i-th block, 1, , jt  and 1, , . ib  Let = ( 1 , … , ) denote the t-vector of responses in the i-th block, assumed to have continuous joint distribution ( ). Denote the marginal distribution of by ( ).
Consider the null hypothesis that the t components of the vector of responses have identical marginal distributions: that is, within each block, observations from different treatment groups have the same distribution function. We call 0 the "Friedman null hypothesis." In each block, rank the t responses from 1 to t (smallest to largest), and let denote the rank assigned to the j-th treatment response in the i-th block. Since the cumulative distribution function of each response is assumed to be continuous, it follows that the probability of a tie in rank between two or more treatments in a block is zero.
Following Quade (1984), we assume that blocks are independent and that all blocks have the same joint distribution of treatment ranks. If the only effect of blocks on the response is additive, then this condition will be met.
Letting ̅ • denote the average rank of the j-th treatment across the b blocks, Friedman's test For small values of t and b, the exact distribution of Q has been tabled (see, for example, Friedman 1937, or Lehmann, 1975 or may be constructed with the aid of software (van de Wiel, 2004). Friedman (1937) showed that as b  , Q converges in distribution to 2 ( 1) t   , a chisquare random variable with ( 1) t  degrees of freedom. Iman & Davenport (1980) propose the general rule that an asymptotic approximation to the distribution of Q should not be used when 3, t  and for 3 t  it should be used only when 9. tb  A null hypothesis different from the Friedman null given in (1) is: that is, the expected rank of the j-th treatment is the same for ). As Friedman's Q sums the squared deviations of the observed average treatment ranks from their expectation under 0 in (2) above, Q is a direct test of this hypothesis: large values of Q support the complement to (2) that the expected values of the average treatment ranks are not the same. It is common practice for Friedman's Q to be applied in situations where interest is in testing equivalency of response means across treatments (St. Laurent & Turk, 2013). Even so, it is likely that in many applications, practitioners are unaware of the relationship between Q and the hypothesis in (2). When 0 is not true, it can be shown that 0 also is not true (see Appendix 1). Thus large values of Friedman's Q can be considered evidence contradicting 0 . However if the hypothesis in (2) is true, it is not necessarily the case that 0 is true. So failing to reject 0 via Friedman's test statistic Q, means only that there is insufficient evidence to support that there are differences in the expected treatment ranks. But that still allows for the possibility that the treatment distributions are not identical. It is in this sense Q is a direct test of (2) and an indirect test of (1). It is partially for this reason that we look for a more general, alternate approach to testing the hypothesis 0 .

Justification
For = 1, … , let = ( 1 , 2 , … , ) denote the random vector of ranks for the t treatments in the i-th block. For each i, the multivariate probability distribution of has support on the set of all possible permutations of the vector of ranks (1,2, , ). the probability that the vector of t treatment ranks in the i-th block matches the ordering of ranks in the k-th permutation. As we have assumed that the probability distribution of the ranks is identical across blocks, we write

The Test Statistic
The random vector has support on the = ! permutations 1 ,, . Then the s-vector (1, ,1) s  pp , in which case ( ) = This suggests using a goodness-of-fit statistic for the s-dimensional multinomial distribution of M as a method of indirectly testing 0 , but getting us "closer" to Friedman's hypothesis than the test statistic Q proposed by Friedman.
One possibility is the chi-square goodness-of-fit statistic and k E are, respectively, the observed and expected counts in the k th cell. In our application this simplifies to Wormleighton (1959) develops the asymptotic properties of a "hierarchy of tests of permutation symmetry" of a t-variate distribution (as extensions of the familiar sign test for 2) t  . The author notes that the test statistics Q and 2 can be thought of as being at the two extremes of this hierarchy, with Q being a test of low order symmetry and 2 being a test of high order symmetry. Wormleighton did not explore small sample properties of the 2 test, nor did he consider its power for alternatives to 0 H p . Wormleighton's work has subsequently received scant attention in the literature. Quade (1984) briefly mentions 2 including its asymptotic null distribution, but does not focus on the small sample properties of the test.
Rayner & Best (2001, ch. 6) consider the relative merits of Page's test, Anderson's test, Q and 2 in testing for differences in treatment distributions in a randomized complete block design, and the relationships between these tests. They note that under the assumption of no difference in treatment distributions 0 () FH , each of these test statistics is asymptotically distributed chisquare with degrees of freedom 1, ( 1), t  2 ( 1) t  and ( ! 1) t  respectively. In choosing which test to apply, Rayner & Best (1990) suggest that "…better tests were those whose degrees of freedom matched the dimensions of the alternative hypothesis." Based on standard results concerning the chi-square goodness-of-fit test in multinomial sampling, when 0 H p is true: and for fixed s, as , b  the statistic 2 converges in distribution to 2 ( 1) s   , a chi-square random variable with ( 1) s  degrees of freedom (Pearson, 1900).
In many if not most applications, we can expect that b will be small, possibly quite small relative to the corresponding rules of thumb for suggested use of the asymptotic chi-square reference distribution. By one such rule-of-thumb (Koehler & Larntz, 1980), under the uniform null hypothesis, each expected cell count bs should be greater than 10 , s or equivalently, 10 10 !, b s t  provided that ≥ 10 For = 3 this suggests an experiment with at least 10 blocks; for = 4, at least 16 blocks; and for = 6, at least 85 blocks. While this requirement on the number of blocks is not unrealistically large for it is useful to consider the exact, small sample distribution of these test statistics, via complete enumeration or simulation, particularly when b is small and t is greater than 4.

Enumeration of the Exact Sampling Distribution of the Test Statistic
The small sample properties, including the exact distribution, of both 2 X and Q depend entirely on t, b and the multinomial vector of probabilities p.
Starting from (4), the goodness of fit statistic simplifies to

Power Comparisons
Rayner & Best (2001, pp. 97-100) report a simulation study comparing the power of four tests including Q and 2 to detect a location shift between treatments in a complete block design with normal errors. They used a randomized test approach to ensure that each test had size = 0.05. For 3 t  and 4 treatments, 5 b  and 10, and two patterns of location-shift for each treatment, 10,000 simulations were run. Their results show that Friedman's test has greater power than 2 for detecting location shift between treatment distributions. They note similar results were obtained with uniform and double exponential error distributions. They did not consider non-location differences between treatments in their study.
In what follows, we compare the power of 2 X to the power of Friedman's Q for plausible location and non-location alternatives to identical treatment distributions based on the exact (small-sample) distribution of the test statistics under both the null and various alternative hypotheses under consideration. We also include the RCB analysis of variance F-test for differences in treatment means in our comparisons as a benchmark, as it has certain wellunderstood optimality properties for detecting location shifts when the treatment distributions are normal. Note that the F-test requires measurements on a continuous scale, while both the Q and 2 X tests require only the relative rankings of the observations in each block.

Design of Study
We looked at 3, 4 t  and 6 treatments for each of b = 5, 10, 20 and 40 blocks. To compare the power of the Q, 2 X and F tests to detect differences in treatment distributions, we chose examples in which treatment distributions differ in location (median or mean) only, in scale only, or in both location and scale. We considered both symmetric and skew treatment distributions and assumed additive block effects, which, without loss of generality were taken to be zero.
The scenarios used in this study are listed in

Type I Error Rates
Because of the discrete nature of the exact null distributions of Q and 2 X , for any fixed nominal significance level 01   it is generally not possible to find critical values for both Q and 2 X that yield tests of size precisely equal to .
 This is especially problematic when b is small. However it is difficult to compare the power of two tests that are not of the same size.
For this reason, rather than fix  at 0.01, 0.05 or some other value, for each combination of t and b, we used the exact or estimated small-sample null distribution of each test statistic to find critical values that would result in comparable and reasonable size tests.   The exact and estimated nominal Type I error rates obtained from this process are summarized in Table 2

Calculation of Power
The exact distribution of both Q and 2 , X and hence the power of these tests, depends upon the treatments only through the vector p of multinomial probabilities associated with each possible ordering of the treatment ranks under the alternative. In turn, for each scenario under consideration p depends on a parameter . With this in mind, it is sometimes convenient to write 1 − ( ( )) and 1 − 2 ( ( )) as the power of the respective tests for a given vector ( ). H p is not true (see Table 1). In these instances the elements of p were calculated via exact or numerical integration using MAPLE 15  n  In all cases, whether the non-null distribution of Q was calculated exactly or estimated via simulation, the exact or estimated power 1 − ( ( )) was taken to be the probability under the

Results
For the sake of brevity, graphs of the results for 4 t  are not given here, but are discussed.

Location Shift
When departure from identical treatment distributions is due solely to a location shift in one of the treatment distributions not only is While the method of construction of these tests insures that for each value of t and b, the achieved Type I error rate (when 0   ) for Q and 2 X matches the nominal values in Table 2, this is not the case for the F test. The estimated Type I error rate for the F test is consistently low, ranging from 55% of its nominal value when 3 t  and 5 b  (0.0122 versus 0.0220), to 74% of its nominal value when 20 b  for 3, t  4 or 6 (e.g., for 3 t  : 0.0279 versus 0.0377). This is perhaps not surprising given that the 2 t distribution used here is heavy-tailed and the F test is designed to detect location differences between normal treatment distributions. Surprisingly, when = 5 the F test is as powerful ( 3, 4 t  ) or more powerful ( = 6) than Q in detecting a location shift in treatment distributions, even though the F test is conservative here.
Results (not shown here) were also obtained using a normal location family in place of the 2 t location family, for the same values of t and b. As one might expect, the F test had the greatest power to detect a location shift, followed by Friedman's Q, and then 2 The relative power of 2 to Q in the normal location family was very similar to the results discussed above.
From these examples, we conclude that the power of the 2 goodness of fit test to detect location differences among treatment distributions, while better than F for heavy-tailed distributions when 3 t  and 20 b  or 40, does not do as well as Q. Excepting very small sample sizes ( 5 b  ), Q outperforms the F test for detecting location differences in heavy-tailed distributions (including other examples examined but not reported here), but less well for lighttailed distributions. With respect to the relative behavior of the F and Q tests, this is consistent with the results of O'Gorman (2001), though his results do not include symmetric distributions as heavy tailed as the 2 tand he did not include 2 X in his study.

Scale Shift
Scenario 2 allows comparison of the power of Q, 2 and F to detect a scale shift in one of the t treatment distributions, using a normal family, as described in section 4.1. The graphs in Figure 2 give the results based on power calculations for a shift in scale by σ = If one knew to expect that any potential differences among the treatments would be due to differences in scale, then one of several common tests might be used to look for treatment differences, including Hartley's test or the Brown-Forsythe test. Here, we imagine the practitioner using Q, 2 X and F to look for differences among treatments not knowing what the nature of the difference might be. In addition, the results here will be helpful in understanding the ability of the three tests under consideration to detect departures (from 0 ) that involve both location and scale shifts.
As one would expect, the estimated size of the F test under this scenario is within simulation error of its nominal value for all t and b.

428
Rank-Based Test for Differences None of the three tests has great power to detect a small-to-moderate shift in scale among the treatments. Friedman  Graphs of the results appear in Figure 4. Similar to Scenario 3, there is no value of  here for which 0 FH is true. However when 0 This is precisely the situation noted by Friedman (1937, second paragraph of footnote 4 on page 678) as requiring "further analysis." Implicitly, it seems that he recognized there might be difficulties in detecting differences in location with his test in the presence of non-constant variance across treatments. With the exception of 6 t  and 5, b  the F test is not a serious competitor for detecting location shift among the treatments in the presence of a fixed scale shift in one treatment distribution. The behavior of Q is reasonably consistent across t as b increases. When 0   and the only difference among the treatment distributions is a scale shift, it has relatively poor power to detect that shift, however its power increases as  increases, consistent with the results seen in scenario 1. For a scale shift, and greater or equivalent power for all location shifts for larger b. In essence, 2 X "already" has power to detect a difference among treatments due to scale shift when 0   (scenario 2) and this power increases as  increasesat least up to a pointwhen X is to be preferred to Q on the basis of its greater power to detect small location shifts and equivalent power to detect large location shifts. In Scenario 5 we consider sampling from a distribution with fixed non-zero skew, where the treatments vary in scale, and location. As described in section 4.1, scenario 5 compares the power of each test to detect a difference among t treatment distributions when ( 1) t  are distributed median-centered exponential,  For fixed b, the power curves for Q and F behave similarly for both among two or more of the distribution functions 1      are stochastically ordered; and in the remaining scenarios none of the alternatives are stochastically ordered. Developing a better understanding of the small-sample behavior of Friedman's test and the 2 X test when the alternatives under consideration do, or do not, exhibit stochastic ordering is the subject of future work.
Aligned Ranks Test. The aligned ranks test proposed by Hodges & Lehmann (1962) and further developed by Sen (1968) is an alternative nonparametric test applicable in the randomized complete block setting that has good power relative to Friedman's test for detecting location shifts (O'Gorman, 2001). Comparing the 2 X test to the aligned ranks test in location and nonlocation shift settings is an avenue for future study.
In this study we considered only alternative distributions for the treatments that differed additively between blocks. An interesting question beyond the scope of this study would be to consider alternatives that incorporated variable treatment effects.
Alternative Approaches. Within the context of a rank-based methodology, we have focused on one approach to testing 0 FH via a test of 0 H p in the multinomial setting using 2 .
X Here one could also consider the class of power-divergence statistics for evaluating departures from 0 H p based on ranks. This class includes 2 X and the likelihood ratio test statistic 2 G as special cases (Cressie & Read 1984, Read & Cressie 1988).