Training Students and Researchers in Bayesian Methods

Frequentist Null Hypothesis Significance Testing (NHST) is so an integral part of scientists’ behavior that its uses cannot be discontinued by flinging it out of the window. Faced with this situation, the suggested strategy for training students and researchers in statistical inference methods for experimental data analysis involves a smooth transition towards the Bayesian paradigm. Its general outlines are as follows. (1) To present natural Bayesian interpretations of NHST outcomes to draw attention to their shortcomings. (2) To create as a result of this the need for a change of emphasis in the presentation and interpretation of results. (3) Finally to equip users with a real possibility of thinking sensibly about statistical inference problems and behaving in a more reasonable manner. The conclusion is that teaching the Bayesian approach in the context of experimental data analysis appears both desirable and feasible. This feasibility is illustrated for analysis of variance methods.


Introduction
Today is a crucial time because we are in the process of defining new publication norms for experimental research.In psychology the necessity of changes in reporting experimental results has been recently made official by the American Psychological Association (Wilkinson et al., 1999;American Psychological Association, 2001).In all experimental fields, and especially in medical research, this necessity is supported more and more by journal editors who require authors to routinely report effect size indicators and their interval estimates, in addition to or in place of the results of traditional Null Hypothesis Significance Testing (NHST).
The present paper is divided into four sections.(1) I argue that NHST is an inadequate method for experimental data analysis, not because it is an incorrect normative model, just because it does not address the questions that scientific research requires.I present and criticize the recommendations proposed by the Task Force of the American Psychological Association to overcome this inadequacy.(2) As an alternative, I suggest teaching Bayesian methods as a therapy against the misuses and abuses of NHST.(3) The feasibility of this teaching is illustrated in the context of analysis of variance methods.(4) Its advantages and difficulties are discussed.In conclusion, training students and researchers in Bayesian methods should become an attractive challenge for statistical instructors.

The stranglehold of null hypothesis significance tests
From the outset (Boring, 1919;Tyler, 1931;Berkson, 1938;etc.),NHST has been subject to intense criticism, both on theoretical and methodological grounds, not to mention the sharp controversy that opposed Fisher to Neyman and Pearson on the very foundations of statistical inference.In the sixties there was more and more criticism, especially in the behavioral and social sciences (see especially Morrison and Henkel, 1970).The fundamental inadequacy of NHST in experimental data analysis has been denounced by the most eminent and most experienced scientists (see Poitevineau, 1998;Lecoutre, Lecoutre and Poitevineau, 2001).
Several empirical studies emphasized the widespread existence of common misinterpretations of NHST among students and psychological researchers (Rosenthal and Gaito, 1963;Nelson, Rosenthal and Rosnow, 1986;Oakes, 1986;Zuckerman, Hodgins, Zuckerman and Rosenthal, 1993;Falk and Greenbaum, 1995;Mittag and Thompson, 2000;Gordon, 2001;Poitevineau and Lecoutre, 2001).Recently, Haller and Krauss (2002) 1 found out that most methodology instructors who teach statistics to psychology students, including professors who work in the area of statistics, share their students' misinterpretations.Furthermore, Lecoutre, Poitevineau and Lecoutre (2003) showed that professional applied statisticians from pharmaceutical companies are not immune to misinterpretations of NHST, especially if the test is nonsignificant.
If some of the above results could be interpreted as an individual's lack of mastery, this explanation is hardly applicable to professional statisticians.More likely these results reveal that NHST does not address the questions that scientific research requires.Thus, users must resort to a more or less "naïve" mixture of NHST results and other information.In other words they must make "judgmental adjustments" (Bakan, 1966;Phillips, 1973, p.334) or "adaptative distorsions" (M.-P.Lecoutre, 2000, p.74) designed to make an ill-suited tool fit their true needs.So the confusion between statistical significance and scientific significance ("the more significant a result is, the more scientifically interesting it is, and/or the larger the true effect is") illustrates such an adjustment and can be seen as an adaptative abuse.The improper uses of nonsignificant results as "proof of the null hypothesis" is again more illustrative; indeed, faced with a nonsignificant result, users seem to have no other choice but to either interpret it as proof of the null hypothesis or attempt to justify it by citing an anomaly in the experimental conditions or in the sample.Also the "incorrect" interpretations of p-values as "inverse" probabilities (1-p is "the probability that the alternative hypothesis is true" or is considered as "evidence of the replicability of the result"), even by experienced users, reveal questions that are of primary interest for the users.Such interpretations suggest that "users really want to make a different kind of inference" (Robinson and Wainer, 2002, p.270).Moreover, many psychology researchers explicitly state that they are dissatisfied with current practices and appear to have a real consciousness of the stranglehold of NHST (M.-P.Lecoutre, 2000).They use significance tests only because they know no other alternative, but they express the need for inferential methods that would be better suited for answering their specific questions.In this context a consensus consists in expecting the statistical analysis to express in an objective way "what the data have to say" independently of any outside information.Indeed very few researchers state that they want to integrate outside information -notably theoretical background -into the statistical analysis of data.

Time for change in teaching statistical inference methods
These findings encourage the many recent attempts to improve the habitual ways of analyzing and reporting experimental data.We can expect with Kirk (2001, p.217) that these attempts "will set off a chain reaction" and in particular that "teachers of statistics, methodology, and measurement courses will change their courses" and that "faculties will require students to learn the full arsenal of quantitative and qualitative statistical tools".We cannot accept that future statistical inference methods users will continue using non appropriate procedures "because they know no other alternative".
So the time has come to create a shift of emphasis in the teaching of statistical inference methods, even in introductory courses for non-statistician students.A more and more widespread opinion is that inferential procedures that bypass the common misuses of significance tests while providing genuine information about the size of effects must be taught in addition to (or even instead of ) NHST.For this purpose, confidence intervals, likelihood, or Bayesian methods are clearly appropriate (e.g., Goodman and Berlin, 1994;Nester, 1996;Rouanet, 1996).Today, the majority trend is to advocate the use of confidence intervals.The following extracts are proposed guidelines by the Task Force of the American Psychological Association (Wilkinson et al., 1999) for revising the statistical section of the American Psychological Association Publication Manual (italics are mines).

Hypothesis tests. "
It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.Never use the unfortunate expression 'accept the null hypothesis.' Always provide some effect-size estimate when reporting a p value."Interval estimates."Interval estimates should be given for any effect sizes involving principal outcomes.Provide intervals for correlations and other coefficients of association or variation whenever possible."Effect sizes.Always present effect sizes for primary outcomes.If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure." Power and sample size."Provide information on sample size and the process that led to sample size decisions.Document the effect sizes, sampling and measurement assumptions, as well as analytic procedures used in power calculations.Because power computations are most meaningful when done before data are collected and examined, it is important to show how effect-size estimates have been derived from previous research and theory in order to dispel suspicions that they might have been taken from data used in the study or, even worse, constructed to justify a particular sample size."

Further difficulties
"It would not be scientifically sound to justify a procedure by frequentist arguments and to interpret it in Bayesian terms" (Rouanet, 2000, in Rouanet et al., page 54).
Confidence intervals could quickly become a compulsory norm in experimental publications.However, for many reasons due to their frequentist conception, confidence intervals can hardly be viewed as the ultimate method.Indeed the appealing feature of confidence intervals is the result of a fundamental misunderstanding.As is the case with significance tests, the frequentist interpretation of a 95% confidence interval involves a long run repetition of the same experiment: in the long run 95% of computed confidence intervals will contain the "true value" of the parameter; each interval in isolation has either a 0 or 100% probability of containing it.It is so strange to treat the data as random even after observation that the orthodox frequentist interpretation of confidence intervals does not make sense for most users.It is undoubtedly the natural (Bayesian) interpretation of confidence intervals in terms of "a fixed interval having a 95% chance of including the true value of interest" which is their appealing feature.
Even experts in statistics are not immune from conceptual confusions about frequentist confidence intervals.So, for instance, Rosnow and Rosenthal (1996, p.336) take the example of an observed difference between two means d = +0.266.They consider the interval [0, +532] whose bounds are the "null hypothesis" (0) and what they call the "counternul value" (2d = +0.532),computed as the symmetrical value of 0 with regard to d.They interpret this specific interval [0, +532] as "a 77% confidence interval" (0.77 = 1 − 2 × 0.115, where 0.115 is the one-sided p-value for the usual t test).If we repeat the experience, the counternull value and the p-value will be different, and, in a long run repetition, the proportion of null-counternull intervals that contain the true value of the difference δ will not be 77%.Clearly, 0.77 is here a data dependent probability, which needs a Bayesian approach to be correctly interpreted.
Beyond these difficulties with frequentist confidence intervals, the proposed guidelines are both partially technically redundant and conceptually incoherent.Just as NSHT, they should result in teaching a set of recipes and rituals (power computations, p-values, confidence intervals. . .), without supplying a real statistical thinking.In particular, one can be afraid that students (and their teachers) continue to focus on the statistical significance of the result (only wondering whether the confidence interval includes the null hypothesis value) rather than on the full implications of confidence intervals.As the authors of these guidelines state, it is probably true that "statistical methods should guide and discipline our thinking but should not determine it ."However it is no less true that it would be "folly of blindly adhering to a ritualized procedure" (Kirk, 2001, p.207).

The Bayesian Alternative
We then naturally have to ask ourselves whether the "Bayesian Choice" will not, sooner or later, be unavoidable (Lecoutre, Lecoutre and Poitevineau, 2001).

What is Bayesian inference for experimental data analysis?
"But the primary aim of a scientific experiment is not to precipitate decisions, but to make an appropriate adjustment in the degree to which one accepts, or believes, the hypothesis or hypotheses being tested" (Rozeboom, 1960).
For the statistician, the role of probabilities, an thus the debates between "frequentists" and "Bayesians", can be expressed in these terms (Lindley, 1993): "whether the probabilities should only refer to data and be based on frequency or whether they should also apply to hypotheses and be regarded as measures of beliefs" (italics added).Bayesian inference, based on a more general and more useful working definition of probability, can address directly problems that the frequentist approach can only address indirectly by resorting to arbitrary tricks.
The most common criticism of the Bayesian approach by frequentists is the need for prior probabilities.Many Bayesians place emphasis on a subjective perspective.An extremist view is that of Savage (1954) who claimed his intention to incorporate prior opinions -not only prior knowledge -into scientific inference.Moreover, by their insistence on the decision-theoretic elements of the Bayesian approach, many authors have obscured the contribution of Bayesian inference to experimental data analysis and scientific reporting.This can be the reasons why until now scientists have been reluctant to use Bayesian inferential procedures in practice for analysing their data.
Without dismissing the merits of the decision-theoretic viewpoint, it must be recognized that there is another approach which is just as Bayesian which was developed by Jeffreys in the thirties (Jeffreys, 1998(Jeffreys, /1939)).Following the lead of Laplace (1986Laplace ( /1825)), this approach aimed at assigning the prior probability when "nothing" was known about the value of the parameter.In practice, these noninformative prior probabilities are vague distributions which, a priori, do not favor any particular value.Consequently they let the data "speak for themselves" (Box and Tiao, 1973, p.2).In this form the Bayesian paradigm provides, if not objective methods, at least reference methods appropriate for situations involving scientific reporting.This approach of Bayesian inference is now recognized like a standard: "We should indeed argue that noninformative prior Bayesian analysis is the single most powerful method of statistical analysis" (Berger, 1985, p.90).

Routine Bayesian methods for experimental data analysis
For more than twenty-five years now, with other colleagues in France I have worked in order to develop routine Bayesian methods for the most familiar situations encountered in experimental data analysis (see e.g., Rouanet and Lecoutre, 1983;Lecoutre, Derzko and Grouin, 1995;Lecoutre, 1996;Lecoutre and Charron, 2000;Lecoutre and Poitevineau, 2000;Lecoutre and Derzko, 2001).These methods can be used and taught as easily as the t, F or chi-square tests.We argued that they offer promising new ways in statistical methodology (Rouanet et al., 2000).
We have especially developed "noninformative methods".In order to promote them, it seemed important to us to give them a more explicit name than "standard", "noninformative" or "reference".We proposed to call them fiducial Bayesian (B.Lecoutre, 2000).This deliberately provocative name pays tribute to Fisher's work on scientific inference for research workers (Fisher, 1990(Fisher, /1925)).It indicates their specificity and their aim to let the statistical analysis express what the data have to say independently of any outside information.Fiducial Bayesian methods are concrete proposals in order to bypass the inadequacy of NHST.They have been applied many times to real data and have been accepted well by experimental journals (see e.g., Hoc and Leplat, 1983;Ciancia et al., 1988;Lecoutre, 1992;Desperati and Stucchi, 1995;Hoc, 1996;Amorim and Stucchi, 1997;Amorim et al., 1997;Clment and Richard, 1997;Amorim et al., 1998;Amorim et al., 2000;Lecoutre et al., 2003Lecoutre et al., , 2004; and many experimental articles published in French).

The desirability of Bayesian methods
Clearly, the Bayesian approach offers more flexibility to experimental data analysis.In order to illustrate its advantages, I will consider the pharmaceutical example used by Student (1908) in his original article on the t test.Given, for each of the n=10 patients the two "additional hour's sleep" gained by the use of two soporifics [1 and 2], Student used his t test for an inference about the difference of means between the two soporifics, "by making a new series, subtracting 1 from 2" (the ten individual differences are given in Table 1).Then he computed the mean +1.58 [d] and the (uncorrected) standard deviation 1.17 [hence s = 1.23, corrected for df ] of this series, and concluded from his table of the "t distribution" that "the probability is .9985or the odds are about 666 to 1 than 2 is the better soporific" (which is not an orthodox frequentist formulation!).In modern statements, we compute the t test statistic for the inference about a normal mean t=+1.58/(1.23/√ 10)=+4.06 and we find the one-sided p-value 0.0014 (9 df ).Some features, outlined hereafter, illustrate the desirability of Bayesian methods that are an alternative to the Task Force Guidelines.

Hypothesis tests: Fiducial Bayesian interpretation of p-values.
Fiducial-Bayesian inference provides insightful interpretations of frequentist procedures in intuitively appealing and readily interpretable forms using the natural language of Bayesian probability.For instance, the one-sided p-value of the t test is exactly the fiducial Bayesian probability that the true difference δ has the opposite sign of the observed difference.Given the Student's data (p = 0.0014, one-sided), there is a 0.14% posterior probability of a negative difference and a 99.86% complementary probability of a positive difference.In the Bayesian framework these statements are statistically correct.
Moreover the fiducial Bayesian interpretation of p-values clearly points out the methodological shortcomings of NHST.It becomes apparent that the p-value in itself says nothing about the magnitude of δ.On the one hand, even a "highly significant" outcome (p "very small") only establishes that δ has the same sign as the observed difference d.On the other hand, a "nonsignificant" outcome is hardly worth anything, as exemplified by the fiducial Bayesian interpretation P r(δ < 0) = P r(δ > 0) = 1/2 of a "perfectly nonsignificant" test (i.e.d = 0).

Interval estimates: Fiducial Bayesian interpretation of the usual CI.
Another important feature is the interpretation of the usual confidence interval in natural terms.In the Bayesian framework, this interval is usually termed a credibility interval or a credible interval, which explicitly accounts for the difference in interpretation.It becomes correct to say that "there is a 95% probability (or guarantee) of δ being included between the fixed bounds of the interval" (conditionally on the data), i.e. for the Student's example between +0.70 and +2.46 hours.
Effect sizes: Straight Bayesian answers.Beyond the reinterpretations of the usual frequentist procedures, other Bayesian statements give straight answers to the question of effect sizes.We can compute the probability that δ exceeds a fixed, easier to interpret, additional time; for instance "there is a 91.5% probability of δ exceeding one hour".Since the units of measurement are meaningful, it is easy to assess the practical significance of the magnitude of δ.To summarize the results, it can be reported that "there is a 91.5% posterior probability of a large positive difference (δ > +1), a 8.4% probability of a positive but limited difference (0 < δ < +1), and a 0.14% probability of a negative difference".Such a statement has no frequentist counterpart.
The question of replication of observations.The Bayesian inference offers a direct and very intuitive solution.Given the performed experiment, the predictive distribution expresses our state of knowledge about future data.For instance, for an additional experimental unit, "there is a 87.4% probability of a positive difference and a 78.8% probability of a difference exceeding half one hour", and for a future sample of size 10, "there is a 99.1% probability of a positive difference and a 95.9% probability of a difference exceeding half an hour".
Power and sample size: Bayesian data planning and monitoring."An essential aspect of the process of evaluating design strategies is the ability to calculate predictive probabilities of potential results."(Berry, 1991, p.81).Bayesian predictive procedures give users a very appealing method to answer essential questions such as: "how big should be the experiment to have a reasonable chance of demonstrating a given conclusion?"; "given the current data, what is the chance that the final result will be in some sense conclusive, or on the contrary inconclusive?"These questions are unconditional in that they require consideration of all possible value of parameters.Whereas traditional frequentist practice does not address these questions, predictive probabilities give them direct and natural answer.
In particular, from a pilot study, the predictive probabilities on credibility limits give a useful summary to help in the choice of the sample size of an experiment.If the data from the pilot study are included in the final analysis, final results for the whole data can be predicted as well (Lecoutre, 2001).Predictive procedures can also be used to aid the decision to abandon an experiment if the predictive probability appears poor.Some relevant references are Berry (1991), Lecoutre, Derzko and Grouin (1995), Joseph and Bélisle (1997), Dignam et al., (1998), Johns and Andersen (1999), Lecoutre (2001), Lecoutre, Mabika and Derzko (2002).
Introducing "informative" priors.If the use of noninformative priors has a privileged status in order to gain "public use" statements, other Bayesian techniques also have an important role to play in experimental investigations.They are ideally suited for combining information from several studies and therefore planning a series of experiments.Realistic uses of these techniques have been proposed.When a fiducial Bayesian analysis suggests a given conclusion, various prior distributions expressing results from other experiments or subjective opinions from specific, well-informed individuals ("experts"), which whether skeptical or enthusiastic, can be investigated to assess the robustness of conclusions (see in particular Spiegelhalter, Freedman and Parmar, 1994).With regard to scientists' need for objectivity, it could be argued with Dickey (1986, p.135) that "an objective scientific report is a report of the whole prior-to-posterior mapping of a relevant range of prior probability distributions, keyed to meaningful uncertainty interpretations".

The Feasibility of Bayesian Methods
We especially developed Bayesian methods in the analysis of variance framework, which is an issue of particular importance for experimental data analysis.Experimental investigations frequently involve complex designs, especially repeated-measures designs.Bayesian procedures have been developed on the subject, but they are generally thought difficult to implement and not included in the commonly available computer packages.As a consequence the possibility of teaching them is still largely questionable for many statistical teachers.
A simple way to deal with the complexity of experimental designs it is to use the specific analysis approach.Roughly speaking, a specific analysis for a particular effect consists in handling only data that are relevant for it.Most of-ten, the design structure of these relevant data is much simpler that the original design structure, and the number of "nuisance" parameters involved in the specific inference is drastically reduced.Consequently, in the Bayesian framework, relatively elementary procedures can be applied and realistic prior distributions can be investigated.Furthermore, necessary and minimal assumptions specific to each particular inference are made explicit.When these assumptions are under suspicion, alternative procedures can be easily envisaged: for instance we can do a transformation of the relevant data, or again use solutions that do not assume the equality of variances, etc.Thus, the advantages of the specific analysis approach over the conventional general model approach appear overwhelming both for the feasibility and the understanding of procedures.
Further justifications can be found in Rouanet and Lecoutre (1983) (see also Lecoutre, 1984 andRouanet, 1996).Note that the interest of the specific analysis approach to analysis of variance is often implicitly recognized.In this way, Hand and Taylor (1987) suggested systematically deriving relevant data before using commonly available computer packages.In a more particular context Jones and Kenward (1989) developed a "simple and robust analysis for two-group dual designs" (page 160) which is typically a specific analysis.
Three decisive advantages of the specific analysis approach can be stressed.(1) All the traditional analysis of variance procedures can be derived as a direct extension of the basic procedures used in descriptive statistics (means, standard deviations) and inferential statistics (Student's t tests).( 2) Complex designs involving several factors can easily be handled; in particular, the exact validity assumptions for each inference can be made explicit and comprehensible.
Statistical computer programs based on the specific inference approach have been developed (Lecoutre and Poitevineau, 1992; Lecoutre, 1996; Lecoutre and  Poitevineau, 2005 2 ).They incorporate both traditional frequentist practices (significance tests, confidence intervals) and Bayesian procedures (non informative and conjugate priors).These procedures are applicable to general experimental designs (in particular, repeated measures designs), balanced or not balanced, with univariate or multivariate data, and covariables.
Other packages designed to teach or learn elementary Bayesian Statistical inference are First Bayes (O'Hagan, 1996) 3 and a package of Minitab macros (Albert, 1996).
I have restricted here my presentation to the analysis of variance framework; however similar materials are also available for inferences about propor-tions (Lecoutre, Derzko and Grouin, 1995;Bernard, 2000;Lecoutre and Charron, 2000).

Training Students and Researchers in Bayesian Methods
"It is their straightforward, natural approach to inference that makes them [Bayesian methods] so attractive" (Schmitt, 1969, preface) In 1976 Jaynes wrote "As a teacher, I therefore feel that to continue the time honoured practice -still in effect in many schools -of teaching pure orthodox statistics to students, with only a passing sneer at Bayes and Laplace, is to perpetuate a tragic error which has already wasted thousands of man-years of our fines mathematical talent in pursuit of false goals.If this talent had been directed toward understanding Laplace's contributions and learning how to use them properly, statistical practice would be far more advanced than it is."(Jaynes, 1976, p.256).It would be folly to perpetuate this error!For more than twenty-five years now, with my colleagues we have gradually introduce Bayesian methods in courses and seminars for audiences of various backgrounds, especially in psychology.Our statistical teaching and consulting experience revealed us that these methods were far more intuitive and much closer to the thinking of scientists than frequentist procedures.So we completely disagree with Moore (1997) who claimed that "Bayesian reasoning is considerably more difficult to assimilate than the reasoning of standard inference".

Teaching strategy
Since experimental publications are full of significance tests, students and researchers are (and will be again in the future) constantly confronted to their use.NHST is so an integral part of scientists' behavior and of experimental teaching that its misuses and abuses should not be discontinued by flinging it out of the window, even if I completely agree with Rozeboom (1997, p.335) that NHST is "surely the most bone-headedly misguided procedure ever institutionalised in the rote training of science students".This reality cannot be ignored, and it is a challenge for the teachers of statistics to introduce Bayesian inference without discarding, neither NHST nor the "official" guidelines that tend to supplant it by confidence intervals.So I argue that the sole effective strategy is a smooth transition towards the Bayesian paradigm (see Lecoutre, Lecoutre and Poitevineau, 2001).
The suggested teaching strategy is to introduce Bayesian methods as follows.
(1) To present natural fiducial Bayesian interpretations of NHST outcomes to call attention about their shortcomings.(2) To create as a result of this the need for a change of emphasis in the presentation and interpretation of results.
(3) Finally to equip students with a real possibility of thinking sensibly about statistical inference problems and behaving in a more reasonable manner.
From an interactive use of our computer programs, a very limited set of preliminary notions is needed to introduce basic ANOVA procedures, that is inferences about one degree of freedom effects in complex designs.The possibility of applying Bayesian methods in the context of realistic complex experimental designs is an essential requirement for motivating students and researchers.The attention can be concentrated about the basic principles and the practical meaning of procedures.As a consequence, the principles of advanced techniques can be more easily understood, independently of their mathematical difficulty.

First example: student data
It is remarkable to notice that the Student's example presented in Section 3.3 was a typical application of the specific analysis approach.The basic data were for each of the n=10 patients the difference between the two "additional hour's sleep gained by the use of hyoscyamine hydobromide [an hypnotic]", the hour's sleep being measured without drug and after treatment with either (1) "dextro hyoscyamine hydobromide" or (2) "laevo hyoscyamine hydobromide" (note that they already were derived data).The Student's analysis is a typical example of specific inference: it only involves the elementary inference about a normal mean.
In the same way, we can apply to the data in Table 1 the elementary Bayesian inference about a normal mean, with only two parameters, the population mean difference δ and the standard deviation σ.Assuming the usual noninformative prior, the posterior (fiducial Bayesian) distribution of δ is a generalized (or scaled) t distribution.It is centered on the mean observed difference d = +1.58 and has a scale factor e = s/ √ n = 0.39.The distribution has the same degrees of freedom q=9 as the t test.This is written δ ∼ d + et q , or again δ ∼ t q (d, e 2 ) -hence here δ ∼ t 9 (+1.58,0.39 2 ) -by analogy with the normal distribution (note that this distribution must not be confused with the noncentral t distribution, familiar to power analysts).The scale factor e is the denominator of the usual t test statistic, that is e = d/t (assuming d = 0).In consequence, the fiducial Bayesian distribution of δ can be directly derived from t=+4.06.This result brings to the fore the fundamental property of the t test statistic of being an estimate of the experimental accuracy, conditionally on the observed value d.More precisely, (d/t) 2 estimates the sampling error variance of d.
Resorting to computers solves the technical problems involved in the use of Bayesian distributions.This gives the students an attractive and intuitive way of understanding the impact of sample sizes, data and prior distributions.The posterior distribution can be investigated by means of visual display.The fiducial Bayesian interpretation of usual significance tests is made explicit.The credibility limits for a given probability (or guarantee), or conversely the probability of a given interval can be computed.
An important aspect of statistical inference is making predictions.Again, the Bayesian inference offers a direct and very intuitive solution.For instance, what can be said about the value of the difference d that would be observed for new data?The predictive distribution for d in a future sample of size n is naturally more scattered than the distribution of δ relative to the population (this is all the more true since the size of the new sample is smaller).Thus the fiducial Bayesian (posterior) predictive distribution for d , given the value d observed in the available data, is again a generalized t distribution (naturally centered on d), d ∼ t q (d, e 2 + e 2 ), where e = s/ √ n .In fact, the uncertainty about δ given the available data (reflected by e 2 ) is added to the uncertainty about the results of the future sample when δ is known (reflected by e 2 ).Given the Student's data, the predictive distribution is d ∼ t 9 (+1.58,1.29 2 ) for a future experimental unit (n = 1) and d ∼ t 9 (+1.58,0.55 2 ) for a replication with the same sample size (e = e).

Second example: reaction time experiment
As an illustration of a more complex design, let us consider the following example, derived from Holender and Bertelson (1975).In a psychological experiment, the subject must react to a signal.The experimental design involves two crossed repeated factors: Factor A (signal frequency) with two levels (a1: frequent and a2: rare), and Factor B (foreperiod duration), with two levels (b1: short and b2: long).The main research hypothesis is a null (or about null) interaction effect between factors A and B (additive model).The n = 12 subjects are divided into three groups of four subjects each.The data treated here and reported in Table 2 are reaction times in ms (averaged over trials).They have been previously analysed in detail with Bayesian methods in Rouanet and Lecoutre (1983), Rouanet (1996) and Lecoutre and Derzko (2001).I will focus here on the technical aspects of the specific analysis approach for one degree of freedom sources of variations, but this approach can be easily generalized to several df sources.
Here the basic data consists of three "groups" and four "occasions" of measure.Since A and B are both two-level factors, their interaction can be represented by a single contrast among the four occasions.Let us consider the contrast with coefficients are called coefficients of derivation upon occasions.The derived relevant data for interaction consist of the twelve individual interaction effects reported in Table 2.They constitute a simple (balanced) one-way layout and the interaction effect amounts to the overall mean δ.This mean is given by the coefficients of derivation upon groups As a general result, a one df effect can be tested from the t statistic t = d/e = −0.217,where e = bs = 9.61 is precisely the scale factor of the fiducial Bayesian distribution.The constant b depends on the coefficients of derivation upon groups v g and on the group sizes f g (here The within group variance s 2 = 33.28 2 is the mean of the group variances weighted by their degrees of freedom f g − 1.In the case of unequal group sizes we could consider either the unweighted mean or the weighted mean, given respectively by the coefficients The following general results ensure the link with the traditional ANOVA procedures.The two mean squares of the usual ANOVA F ratio, are respectively proportional to d 2 and s 2 : MS A.B = (d/(ab)) 2 = 13.02 and MS S(G).A.B = (s/a) 2 = 276.84.The constant a only depends on the coefficients of derivation upon occasions w o : a 2 = w 2 o = 4.All these formulae are made explicit in our computer programs.With these notations, all inferential (frequentist and Bayesian) procedures are simply modeled on the inference on a normal mean.
Any one df source of variation of interest can be analyzed in the same way.Suppose for instance that group g3 is a control group; then we may plan to decompose some effects involving factor G according the following two contrasts: g2, g1 (opposing g2 and g1) and g3, g1 g2 (opposing g3 on the one hand and g1 and g2 on the other hand).The specific analysis of these two contrasts involves as relevant data the twelve individual means reported in Table 2.The coefficients of derivation upon occasions are [w o ] = [1/4 1/4 1/4 1/4] (a 2 = 1/4) and we consider for the derived data the two (orthogonal) contrasts between groups with the respective coefficients From the relevant data for interaction, we can again analyze the interactions between A.B and these two contrasts.Table 3 gives a summary of the specific analyses of all sources of variations.Codings for the weights: a=+1/4, b=1/2, c=−1/2, d=1/3, e=−1, f=1, g=0.

A Challenge for Statistical Instructors
Training students and researchers in Bayesian methods should become an attractive challenge for statistical instructors.It is often claimed that Bayesian methods need new probabilistic concepts, in particular the Bayesian definition of probability, conditional probabilities and Bayes' formula.However, since most people use "inverse probability" statements to interpret NHST and confidence intervals, these notions are already -at least implicitly -involved in frequentist methods.Which is simply required for teaching the Bayesian approach is a very natural shift of emphasis about these concepts, showing that they can be used consistently and appropriately in statistical analysis.

A natural change of emphasis about probabilistic concepts
"[Bayesian analysis provides] direct probability statements -which are what most people wrongly assume they are getting from conventional statistics" (Grunkemeier andPayne, 2002, p.1901)A recent empirical study (Albert, 2003) indicates that students in introductory statistics class are generally confused about the different notions of probabilities.Clearly, teaching NHST and confidence intervals can only add to confusion, since these methods are justified by frequentist arguments and generally (mis)interpreted in Bayesian terms.Ironically these heretic interpretations are encouraged by the duplicity of most statistical instructors who tolerate and even use them.For instance Pagano (1990, p.288) describes a 95% confidence interval as "an interval such that the probability is 0.95 that the interval contains the population value".Other authors claim that the "correct" frequentist interpretation they advocate can be expressed as "we can be 95% confident that the population mean is between 114.06 and 119.94" (Kirk, 1982, page 43), "95% confident that θ is below B(X)" (Steiger and Fouladi, 1997, p.230) or "we may claim 95% confidence that the population value of multiple R 2 is no lower than .0266"(Smithson, 2001, p.614).It is hard to imagine that students or scientists can understand that "confident" refers here to a frequentist view of probability!So, in a recent paper, Schweder and Hjort (2002) gave the following revealing definitions of probability: "we will distinguish between probability as frequency, termed probability, and probability as information/uncertainty, termed confidence" (italics added).After many attempts to teach the "correct" interpretation of frequentist procedures, I completely agree with Freeman (1993) that in these attempts "we are fighting a losing battle".
Regarding conditional probability and Bayes' formula, the traditional teaching of frequentist procedures is also misleading.This is especially revealed by the fact that even experienced researchers frequently confused "the [conditional] probability of making a Type I error if the null hypothesis is true" and "the marginal probability of making a Type I error".So Azar (1999)4 wrote: "[a significant result] indicates that the chances of the finding being random is only 5 percent or less"; this statement was later commented by Bakeman (1999) 5 as "a misunderstanding that generations of instructors of statistics clearly have failed to eradicate".This can be due to the fact that little or no emphasis is placed on conditional probabilities in most of the frequentist presentations.For instance, standard statistical textbooks speak about "the probability of making a Type I [Type 2] error" by omitting the conditional argument "given H 0 [H 1 ]" (see e.g., Kirk, 1982, pages 36-37).I believe with Berry (1997) that conditional probabilities are intuitive for many people.Also, Bayes' formula is easily understood if it is introduced from contingency tables with probabilities interpreted as frequencies so that prior probabilities can be supposed exactly known (see Box and Tiao, 1973, p.12).
Considerable difficulties are due to the mysterious and unrealistic use of the sampling distribution for justifying NHST and confidence intervals.Frequent questions asked by students show us that this use is counterintuitive: "why must one calculate the probability of samples that have not been observed?";"why one considers the probability of samples outcomes that are more extreme than the one observed?";etc.Such difficulties are not encountered with the Bayesian inference: the posterior distribution, being conditional on data, only involves the sampling probability of the data in hand, via the likelihood function that writes the sampling distribution in the "natural order".

5.2
The Bayesian approach gives tools to overcome usual difficulties "I stopped teaching frequentist methods when I decided that they could not be learned" (Berry, 1997).
There are hardly -if not Bayesian -intuitive justifications of frequentist procedures.On the contrary, with the Bayesian approach, intuitive justifications and interpretations of procedures can be given, so that the level of mathematical justifications can be easily adapted to the students state of knowledge.So it can be argued with Albert (1995 6 , 1997) and Berry (1997) that elementary Bayesian inference can be taught effectively to undergraduate students and that students benefit greatly from such instruction.Moreover, an empirical understanding of probability concepts is gained by applying Bayesian procedures, especially with the help of computer programs.
Our experience with Bayesian methods is that they allow students to over-come usual difficulties encountered with the frequentist approach.Of course, the following list is not exhaustive and empirical studies for asserting our conclusions should be welcome.It can be hard for students to distinguish a parameter, such as a population mean, from the observed mean statistic computed from a sample.The two notions of posterior distribution and predictive distribution of future data, given available data, are useful tools to give students an understanding of this essential distinction.Moreover, the predictive distribution can be used to give, as limiting cases: (1) the sampling distribution of a statistic when the prior distribution tends to a point distribution ("known parameter"); (2) the posterior distribution when the sample size of the future data tends to infinity (the parameter can be seen as the observed statistic in a future sample of very large size).
Moreover, the notions of posterior and predictive distributions, being fundamental tools for a better understanding of sample fluctuations, allow the students to be aware of misconceptions about the replication of experiments.Indeed, many people overestimate the probability of repeating a significant result (Tversky and Kahneman, 1971;Lecoutre and Rouanet, 1993).Similar misconceptions are encountered with confidence intervals.An empirical study (Cumming et al., 2004) suggested that many "leading researchers" in psychology, behavioural neuroscience, and medicine "hold the confidence level misconception that a 95% CI will on average capture 95% of replication means", underestimating the extent that future replications will vary.
An important difficulty with the logic of NHST is that it requires that the hypothesis to be demonstrated should be the alternative hypothesis.Of course this artifice can be completely avoided with the Bayesian approach, which provide direct answers to the right questions: "what is the probability that the difference between two means is large?";"what is the probability that the difference (in absolute value) is small?";"given the current inconclusive partial data, what is the chance that the final result will be conclusive?";etc.
Using the fiducial Bayesian interpretations of significance tests and confidence intervals in the natural language of probabilities about unknown effects comes quite naturally to students.In return the common misuses and abuses of NHST appear to be more clearly understood.In particular students become quickly alerted that nonsignificant results cannot be interpreted as "proof of no effect".I completely agree with Berry (1997) who ironically concludes that students exposed only to a Bayesian approach "come to understand the frequentist concepts of confidence intervals and P values better than do students exposed only to a frequentist approach".
An important objective of statistical teaching is to prepare students to read experimental publications.For the reasons exposed above, with the Bayesian approach students are well equipped for an intelligent and critical reading.In fact, the Bayesian approach fits in better than the frequentist approach with the usual way of reporting experimental results, which seldom involves explicitly the basic concepts of the NHST reasoning (null hypothesis, α level . . .).
By interactively investigating various prior distributions and contrasting the resulting posterior with the fiducial Bayesian solution, students can gain understanding and intuition about the relative roles of sample size, data and external information.Investigating predictive distributions by varying the respective sample sizes of the available and future data is also useful to give students an intuitive understanding of the role of sample size.

Some possible difficulties with the Bayesian approach
The most often denounced difficulties with the Bayesian approach lie in the elicitation of the prior distribution.Berry (1997) places emphasis on the fact that prior and posterior Bayesian distributions are subjective and forces students to assess their prior probabilities, while recognizing the difficulties of this task ("they don't like it").At least the role of subjective probability should be clarified (D'Agostini, 1999).
However, insofar as experimental data analysis is concerned, I do not think that it is a good strategy to draw the attention of students (or researchers) on an approach that does not answer their expectations (see Section 2.1).So, we always avoid -at least in a first stage -the issue of assessing a "subjective" prior distribution and focus our teaching on the fiducial Bayesian procedures.Once that students will become familiarized with their use and interpretation, there are appealing ways to introduce "informative" prior distributions at a later stage.In particular, students generally find attractive to investigate the impact of a handicap ("skeptical") prior and to examine if the data give sufficient evidence to counterbalance it.Priors that express the results of previous experiments are also generally well-accepted.Finally, on can show that the elicitation of prior opinions from "experts" in the field can be useful in some studies, but it must be emphasized that this needs appropriate techniques (see for an example in clinical trials Tan et al., 2003).
Other difficulties can be due to confusions with the frequentist interpretations.For instance, some students erroneously conclude from the posterior distribution that the observed difference -not the population difference -is large, which can be due to a confusion with the NHST reasoning (a result is significant if the observed difference is "in a sense" large).A possibility is not to teach frequentist methods (Berry, 1997).However, in the current context, this is hardly a realistic attitude.An alternative line of attack is to use the combinatorial (or set-theoretic) inference approach suggested by Rouanet and Bert (2000) (see also Rouanet, Bernard and Lecoutre, 1986;Rouanet, Bernard and Le Roux, 1990).Roughly speaking, this approach consists of ruling out the "randomness" character of the concept of sample and to replace probabilistic formulations by formulations in terms of "proportions of samples".The teaching motivation is to allow students to learn the computational aspects of frequentist inference procedures without being prematurely concerned with the conceptual difficulties of probabilistic concepts.Thus, the probabilistic formulations -and in particular the interpretation of frequentist procedures -are reserved to the Bayesian approach, minimizing possible source of confusions.

Conclusion
"It could be argued that since most physicians use statement A [the probability the true mean value is in the interval is 95%] to describe 'confidence' intervals, what they really want are 'probability' intervals.Since to get them they must use Bayesian methods, then they are really Bayesians at heart! " (Grunkemeier andPayne, 2002, p.1904) Nowadays Bayesian routine methods for the familiar situations of experimental data analysis are easy to implement.They fulfill the requirements of scientists and they fit in better with their spontaneous interpretations of data than frequentist procedures.So they can be taught to non-statistician students and researchers in intuitively appealing form.Using the fiducial Bayesian (using noninformative priors) interpretations of significance tests and confidence intervals in the natural language of probabilities about unknown effects comes quite spontaneously to students.In return the Bayesian approach bypasses usual difficulties encountered with frequentist procedures, and in particular the common misuses and abuses of NHST are more clearly understood.Users' attention can be focused to more appropriate strategies such as consideration of the practical significance of results and the replication of experiments.

Table 1 :
Rection time experiment: Basic data and relevant data for interaction and for group cpmparisons.

Table 2 :
Reaction time experiment: basic data and relevant data for interaction and for groups comparisons

Table 3 :
Reaction time experiment: Summary table of specific analyses