A Two-Stage Bayesian Model for Predicting Winners in Major League Baseball

The probability of winning a game in major league baseball depends on various factors relating to team strength including the past per- formance of the two teams, the batting ability of the two teams and the starting pitchers. These three factors change over time. We combine these factors by adopting contribution parameters, and include a home field ad- vantage variable in forming a two-stage Bayesian model. A Markov chain Monte Carlo algorithm is used to carry out Bayesian inference and to sim- ulate outcomes of future games. We apply the approach to data obtained from the 2001 regular season in major league baseball. The probability of winning a game in major league baseball (MLB) depends on various factors relating to team strength including the past performance of the two teams, the batting ability of the two teams and the starting pitchers. We define the relative strength of a team over a competing team at a given point in time by combining these three factors into a single measurement. Although other factors relating to team strength may influence the probability of winning, we as- sume that their effect is minor, and note that the three main factors mentioned above are those that are traditionally considered in the setting of betting odds by bookmakers (McCune 1989). Bookmakers are also aware that the home field advantage significantly influences the probability of winning, and likewise, we in- clude it in our calculations. In this paper, we propose a two-stage Bayesian model based on the relative strength variable and the home field advantage variable to predict the outcomes of games in MLB. MLB in the United States is divided into two leagues and six divisions. The American League (AL) has three divisions and the National League (NL) has three divisions. Each team plays 162 games in the regular season (April through


Introduction
The probability of winning a game in major league baseball (MLB) depends on various factors relating to team strength including the past performance of the two teams, the batting ability of the two teams and the starting pitchers.We define the relative strength of a team over a competing team at a given point in time by combining these three factors into a single measurement.Although other factors relating to team strength may influence the probability of winning, we assume that their effect is minor, and note that the three main factors mentioned above are those that are traditionally considered in the setting of betting odds by bookmakers (McCune 1989).Bookmakers are also aware that the home field advantage significantly influences the probability of winning, and likewise, we include it in our calculations.In this paper, we propose a two-stage Bayesian model based on the relative strength variable and the home field advantage variable to predict the outcomes of games in MLB.
MLB in the United States is divided into two leagues and six divisions.The American League (AL) has three divisions and the National League (NL) has three divisions.Each team plays 162 games in the regular season (April through early October) which does not include the pre-season and the post-season playoffs.Summary statistics for the 2001 MLB regular season appear in Table 1.Overall win percentages are followed by the home and away win percentages, the overall team batting average and the overall team earned run average (ERA).We observe that most teams win more often at home than on the road, and this suggests that the home field advantage variable may be useful.The 2001 regular season is noteworthy in that Seattle finished with an outstanding 116-46 won-loss record, tying the 1906 Chicago Cubs' MLB record for wins in a season.Seattle also established the AL record for road wins in a season (59) and set a major league record with 29 consecutive road series won or tied.More detailed statistical information on MLB is available from numerous internet websites including www.sportline.com and www.sportsillustrated.com.
The relative strength of a team over a competing team at a given point in time is based on three ratios: (1) the ratio of the winning percentages between the two teams, (2) the ratio of the overall team batting averages and (3) the ratio of the ERAs between the two starting pitchers.We note that on the day of a MLB game, all three ratios are typically available.Later we demonstrate that ratio (3) is more important in determining the probability of a win than ratio (1) and ratio (2).The importance of ratio (3) is consistent with our intuition where we believe, for example, that in the 2001 season, the Arizona Diamondbacks have a higher probability of winning when either Randy Johnson or Curt Schilling is pitching.Clearly, it is unreasonable to expect that all three ratios affect the probability of winning in the same way.For this reason, we adopt three unknown contribution parameters that assign relative importance to each of the three ratios.
The contribution parameters are assumed constant across all teams and arise from independent prior densities.The home field advantage parameter is assumed to be apriori independent of the contribution parameters.Therefore our model essentially relies on four primary parameters; the three contribution parameters and the home field advantage parameter.This parametrization is much simpler than many other approaches.For example, the Bradley and Terry (1952) model requires at least n − 1 parameters where a team parameter is defined for each of the n teams and the team parameters sum to a constant.In addition, our relative strength variable is flexible as it readily accommodates other factors, and can be easily modified for various sports such as basketball, football, soccer and hockey.
A two-stage Bayesian model based on the relative strength variable and the home field advantage variable is proposed to predict the outcome of games in MLB.In the first stage, we assume that the probability that a given team wins is a random sample from a beta distribution with parameters based on the relative strength variable and the home field advantage variable.In the second stage, the outcome (win or loss) is a random sample from a Bernoulli distribution with this winning probability.We contrast the two-stage model with a one-stage model where the outcome (win or loss) is a random sample from a Bernoulli(p) distribution whose probability p is a direct function of the relative strength variable and the home field advantage variable.Thus, the two-stage model offers an extra level of variation to account for the difficulty that investigators have encountered when modelling outcomes of MLB baseball games (Kaigh 1995 andMcCune 1989).In addition, the two-stage Bayesian structure permits convenient Bayesian inference.
We observe that the posterior distribution arising from the two-stage Bayesian model is of a form that is not readily interpretable.In this context, Markov chain Monte Carlo (MCMC) algorithms sample variates according to a Markov chain whose stationary distribution is the desired posterior distribution.The goal is to average the sampled variates to estimate posterior quantities of interest.We consider a "Metropolis within Gibbs" MCMC algorithm (see Gilks, Richardson and Spiegelhalter 1996) which is useful when the full conditional densities of the Gibbs sampling algorithm (Geman and Geman 1984) are non-standard.The Gibbs sampling algorithm is perhaps the simplest of the MCMC algorithms.Additional reading on Gibbs sampling in related contexts can be found in Gelfand and Smith (1990), Barry and Hartigan (1993), and Kuo and Yang (1995, 1996, 2000).MCMC sampling is used to simulate the outcomes of future games and then predict the eventual division winners.Kaigh (1995) considers a simple method of prediction for major league baseball using only the home and away records of the competing teams.Predictions are compared against results from the 1989-1993 MLB regular seasons.It is not evident that the simple predictive model yields a profitable betting strategy.Barry and Hartigan (1993) consider a more complex choice model for prediction in MLB.Like our model, the Barry and Hartigan model is Bayesian, allows for changing team strengths over time and relies on a MCMC implementation.Some distinguishing features of their model include the assumption that team strengths are discrete, the exclusion of team batting averages as covariates and the huge number of parameters required.Perhaps the greatest practical difference, however, is that their model does not consider the effect of starting pitchers.Since starting pitchers may critically affect the outcome of a game, the Barry and Hartigan (1993) model cannot be used for accurate prediction of individual games.Their model is appropriate only for predicting a large number of games (e.g. the remainder of a season) where the effect of individual pitchers is averaged over the large number of games.Additional reading on baseball and prediction can be found in Bennett (1998), James, Albert andStern (1997), and Barry and Hartigan (1994).
Section 2 provides a complete specification of the full two-stage Bayesian model and indicates sub-models that may be appropriate.In Section 3, we adopt a Markov chain Monte Carlo approach for Bayesian inference.The full conditional distributions are derived and sampling is carried out using the Metropolis within Gibbs algorithm.Section 4 looks at the problem of prediction with specific emphasis on the prediction of results for the remainder of the season.In Section 5, simulation results from the models are studied in comparison with the actual final results from the 2001 MLB regular season.Some concluding remarks are then provided in Section 6.

The Model
Suppose that the T games in a MLB regular season are indexed in time according to s = 1, . . ., T .
We calculate covariates (α s , β s , γ s ) immediately prior to time s where • α s denotes the ratio of the winning percentages between the two teams • β s denotes the ratio of the overall team batting averages • γ s denotes the ratio of the starting pitchers' ERAs Without loss of generality, we let the numerator in the first two ratios correspond to the home team, and we let the denominator in the third ratio correspond to the home team.
As pointed out by a referee, there is variability in the covariates and the variability is decreasing in time s.In particular, early in the season, some pitchers may have extraordinarily high or low ERAs.For this reason, we truncate γ s according to 1/γ 0 ≤ γ s ≤ γ 0 for prescribed γ 0 .For the MLB data considered in this paper, we let γ 0 = 5.From a practical point of view, we consider only games corresponding to s ≥ t 0 for some moderate value t 0 , to allow the covariates to stabilize.
Typically, with MLB data, one would not expect extreme variation in the covariates α s and β s for It is unreasonable to expect that all three ratios affect the probability of winning in the same way.We therefore adopt unknown contribution parameters r = (r 1 , r 2 , r 3 ) where we define the relative strength of the home team over the visiting team at time s as We assume that the contribution parameters are constant across all MLB teams and arise from independent prior distributions with r i ∼ uniform(0, a i ) where a i is prescribed, i = 1, 2, 3.The value r i close to 0 implies that the corresponding ratio has little effect on the relative strength.For the MLB data considered in this paper, we let a i = 2 for i = 1, 2, 3. We view these as subjective priors where we use our knowledge of MLB to impose effects based on the current winning percentage, batting and pitching.Note that in the context of the models described below, a i = 2 is a realistic upper bound for the value of r i , i = 1, 2, 3.
The relative strength of the visiting team over the home team at time s is therefore given by the than 1, this implies that it is the stronger (weaker) team at time s.We note that the relative strength variable λ s is flexible as it readily accommodates other factors, and can be easily modified for various sports.
We also consider a home field advantage variable δ which is assumed constant across all MLB teams and is assumed independent of the contribution parameters.The visiting field effect is then defined as 1/δ where we interpret δ > 1 as a home field advantage.We assign a uniform(δ 0 , δ 1 ) prior distribution for δ where δ 0 and δ 1 are prescribed.For the MLB data considered in this paper, we let δ 0 = 0 and δ 1 = 2.The apriori belief E(δ) = 1 therefore implies that the home field has no effect.
Having defined the relative strength λ s and the home field advantage δ, we now propose three predictive models in increasing levels of complexity.Let the random variable X s equal 1 (0) if the home team wins (loses) in the sth game.We are interested in the probability distribution of X s .
For convenience, we define p s = Prob(X s = 1).
Model 1: Prob(X s ) = p Xs s (1 − p s ) 1−Xs and p s = Model 1 is a simple one-stage model where X s is a Bernoulli random variable with expected value p s .One of the features of the parametrization is that the probability of the visiting team winning is given by Model 2 is a two-stage model where the first stage uses the same Bernoulli distribution as in Model 1 and the second stage imposes a beta prior distribution.Note that conditional on p s , the expected value of X s is again p s .Model 2 offers an additional level of variation beyond Model 1 to account for the difficulty that investigators have encountered when modelling the outcomes of MLB games.

Bayesian Inference
Consider inference for the primary parameters r, δ, m and the p s in the full two-stage Bayesian model (Model 3).We refer to the fixed parameters a 1 , a 2 , a 3 , δ 0 , δ 1 and m 0 as hyperparameters of the model.
In the 2001 MLB regular season, each of the 30 teams is scheduled to play 162 games leading to a total of T = 15(162) = 2430 games.Again, the games are indexed in time according to s = 1, . . ., t, . . ., T where t is the current time.We denote the covariate triple D s = (α s , β s , γ s ) for s = t 0 , . . ., t − 1 where the components of D s are the three ratios defined in Section 2. Considering games from only time t 0 onwards, the posterior density of p t 0 , . . ., p t−1 , m, δ and r is therefore proportional to the likelihood times prior and is given by where , 2, 3 and 0 < p s < 1 for s = t 0 , . . ., t − 1.
For inference we use the empirical measure of the samples generated by the Gibbs sampler.From (1), the Gibbs sampler requires iterative variate generation from the following four full conditional densities: The generation of the p s variates is straightforward as almost every statistical package has a built-in beta generator.The generation of m, δ, r 1 , r 2 and r 3 from their respective full conditional densities is less obvious as their densities have non-standard forms.For these variates, we use a Metropolis within Gibbs step (Gilks, Richardson and Spiegelhalter 1996) where in each case we use the corresponding prior distribution as the proposal distribution.Note that this is a practical advantage in avoiding improper priors.
We now estimate the probability that the home team wins at time t.Denote the data at time t as data = (x t 0 , . . ., x t−1 , D t 0 , . . ., D t−1 ).
Then the predictive density at time t is given by where π(m, δ, r | data) is the marginal posterior density obtained from (1).It is not practical to integrate according to (2).Instead, the integration is carried out using the MCMC algorithm for Model 3. Retaining the values (m, δ, r) obtained in an iteration of the MCMC algorithm, generate p t ∼ beta(mλ t δ, m) and then generate X t from a Bernoulli(p t ) distribution.The X t are now a sample from the predictive density in (2) and can be averaged to estimate the predictive probability that the home team wins.Noting that E(X t |data) = E(P t |data), a simpler and more efficient estimate of the predictive probability that the home team wins can be obtained by averaging the λ t δ/(λ t δ + 1) values.This technique is sometimes referred to as Rao-Blackwellization.

Prediction
Besides making inferences on the unknown parameters, we are also interested in predicting the outcome of games for the remainder of the season.We remark that the MLB schedule is determined well in advance of the season, and therefore we always know who plays whom at a given point in time.
Recall that t is the current time in the season.The predictive density of X t , . . ., X T can be obtained via where the terms within the product are given in (2).
Our Markov chain algorithm can also be used for estimating results for the remainder of the season.After the chain has stabilized, we retain the values (m, δ, r) from a given iteration.We generate p t ∼ beta(mλ t δ, m) and then generate X t ∼ Bernoulli(p t ).Next we generate p t+1 ∼ beta(mλ t+1 δ, m) followed by X t+1 ∼ Bernoulli(p t+1 ).We continue the process obtaining a single sequence X t , . . ., X T .For each team, we combine the number of wins at the current time in the season with the number of wins during the simulated season t, . . ., T to obtain the number of wins over the full season.The entire process is repeated over multiple simulated seasons to derive estimates for the total number of wins.A practical difficulty concerns the updating of D s = (α s , β s , γ s ) used in the construction of λ s during the simulated part of the season s = t, . . ., T .
The win ratios α s change for a team as they win and lose games in the simulated part of the season.Although α s is readily changed according to X t 0 , . . ., X s−1 , the ratios for team batting β s and starting pitchers' ERAs γ s are unavailable for games well off into the future.As a compromise, we fix β s = β t and set γ s equal to the team average at time t, s = t, . . ., T .

Analysis of 2001 MLB Data
Collecting relevant information on the entire 2001 MLB regular season is a formidable task.
As such, we considered only games played on each of twelve dates, beginning April 15 and spread nearly evenly across the season.In these games, we observe 106 home team wins out of a total of 179 games giving a home field winning proportion of 0.59.This sample proportion is a little larger than historical values for MLB (Berry 2001).Table 2 lists the sample means and standard errors for α, β, γ and the home field winning variable X.We observe that the sample means of α, β and γ are what we expect, e.g.nearly centered about 1.Note that β does not vary greatly about 1.0; this is an early suggestion that the batting variable may not have a significant role in prediction.
We first consider practical convergence of the Markov chain using the CODA software (Best, Cowles and Vines 1995).CODA is a set of S-Plus functions that gives graphical summaries and diagnostics of convergence from MCMC output.The results from CODA suggest that practical convergence is achieved after 1,000 iterations.The following results are based on 1000 iterations and 100 replications after 1000 burn-in iterations.
Table 3 provides posterior means and posterior standard deviations of the primary parameters r, δ and m.The posterior distributions of the r variables are somewhat flat confirming the difficulty in determining the exact contribution of the winning percentage, batting and pitching to the prediction problem.It also points to the need for larger data sets in future analyses.To interpret the magnitude of the home field advantage, consider two teams with equal relative strength λ s = 1.In this case, the posterior mean of δ given by 1.58 implies that the home team wins with expected probability 0.61, which is close to the sample mean of X given by 0.59.The posterior mean m = 5.23 is less than the prior mean 10.0 and is in the direction of Model 2 where the value of m is 1.This provides some support for Model 2 although we rely on a model choice criterion based on prediction.
We now turn to the prediction problem where the remainder of the 2001 MLB season is simulated from different points in time.We note that there are very few MLB games that are rained out and not rescheduled.Therefore almost all of the 30 teams completed the full 162 games.We compare the predictive ability of the three models given in Section 2. The criterion used is the sum of squared differences between the expected predictive win probabilities and the final season winning percentages over all 30 MLB teams.The chosen current time points are May 30, June 30, July 30 and August 30 in the regular season.The results are shown in Table 4 where the last column indicates the squared difference criterion between the current winning percentages and the final season winning percentages.The last column therefore corresponds to simple extrapolation of a team's current performance to the end of the season.Naturally, we expect final season predictions to improve as the season progresses and this is the case in Table 4.We observe that the predictive ability of all three models is superior to simple extrapolation.Given the aforementioned difficulty of prediction in MLB, we stress that this is a promising result and is indicative of model adequacy.
Amongst the three models, it seems that Model 3 may be overfitting although all three models are comparable in their predictive performance.We prefer Model 2 as it is slightly better than Model 3 in prediction and allows more variation than Model 1 without requiring additional parameters.
The results from Table 3 also suggest that we might prefer Model 2 over Model 1.
In Table 5, we provide the prediction results for each of the 30 MLB teams from two points in time (May 30, July 30) based on Model 2. We note that 22 of the 30 predictions from July 30 are closer to the final winning percentages than the current winning percentages recorded on July 30.
We also note that the model tends to exhibit a regression towards the mean effect.For example, using results to three decimal places, all six division leaders on July 30 have a predicted winning percentage that is less than their actual winning proportion on July 30.
It is also interesting to compare the predictive impact of covariates α s , β s and γ s .The seven types of covariate combinations under Model 2 are given by S 5 : λ s = α r 1 s γ r 3 s , S 6 : λ s = β r 2 s γ r 3 s , S 7 : λ s = α r 1 s β r 2 s γ r 3 s .
The relative strengths of S 1 , S 2 and S 3 consist of only one effect.The relative strengths of S 4 , S 5 and S 6 consist of two effects and the relative strength of S 7 consists of all three effects.Table 6 provides the sum of squared differences between the expected predictive win probabilities and the final season winning percentages over all 30 MLB teams based on S 1 to S 7 .For comparison purposes, the criterion is also reported for the simple extrapolation model.The results are not definitive but are in accord with prevailing wisdom that suggests that starting pitching is a key determinant in prediction.

Concluding Remarks
We have seen that the proposed two-stage Bayesian model is effective in predicting division winners in MLB given results partway through the season.The model is simple, easily incorporates

Model 3 :
Prob(X s ) = p Xs s (1 − p s ) 1−Xs and p s ∼ beta(mλ s δ, m) Model 3 generalizes Model 2 by introducing the parameter m > 0. When m is equal to 1, then Model 3 reduces to Model 2. When m → ∞, then Model 3 reduces to Model 1. Thus Model 1 and Model 2 can be viewed as sub-models, and we can assess their suitability according to the posterior distribution of m in Model 3. Note that Model 3 implies E(p s | m) = λ s δ/(1 + λ s δ) as in Model 1and in Model 2, and Var(p s | m) = λ s δ/(λ s δ + 1) 2 (mλ s δ + m + 1).Therefore we can maintain the same mean structure, yet the probability distribution can be made more or less diffuse according to the value of m.We choose an exponential prior distribution with mean m 0 for m.In this paper, we let m 0 = 10.

Table 2 :
Sample means and standard errors of α, β, γ and X.

Table 3 :
Priors, posterior means and posterior standard deviations of the primary parameters.

Table 4 :
Squared error difference criterion for comparing the three models of Section 2 and the simple extrapolation model.

Table 5 :
Predicted probabilities from different points in time using Model 2.

Table 6 :
Squared error difference criterion for assessing covariate combinations as described in (4).The simple extrapolation model is also included.