Predicting Match Outcomes in the English Premier League : Which Will Be the Final Rank ?

In any sport competition, there is a strong interest in knowing which team shall be the champion at the end of the championship. Besides this, the end result of a match, the chance of a team to be qualified for a specific tournament, the chance of being relegated, the best attack, the best defense, among others, are also subject of interest. In this paper we present a simple method with good predictive quality, easy implementation, low computational effort, which allows the calculation of all the interesting quantities above. Following Lee (1997), we estimate the average goals scored by each team by assuming that the number of goals scored by a team in a match follows a univariate Poisson distribution but we consider linear models that express the sum and the difference of goals scored in terms of five covariates: the goal average in a match, the home-team advantage, the team’s offensive power, the opponent team’s defensive power and a crisis indicator. The methodology is applied to the 2008-2009 English Premier


Introduction
Football is one of the most popular sports in the world.Played in several countries, it is a collective sport played by two teams whose purpose is to put the ball into the opposing team's goal (score a goal).The team which scores more goals wins the match.A draw occurs when the number of goals scored by the teams is the same.In several countries there are many football clubs competing for regional and national championships in several leagues.There are also intracontinental and intercontinental championships that can be played by teams which obtain qualification for.
This paper is directed to the English Premier League, which is one of the biggest and most valued national championships of football clubs.We particularly concentrated on the 2008-09 Premier League season which was the seventeenth one since its establishment.During the course of a season twenty participating teams play in a single group; each team plays the others twice, one at their home stadium and the other at their opponents', for a total of 380 matches.The team that scores the most points at the end is declared the champion.If one or more teams finish with the same number of points, the teams are ranked according to the following order criteria: the goal difference followed by the goals scored.If a tie still persists, either in the championship, to qualify to other competitions or to relegation, a play-off match at a neutral venue decides the rank.The three best ranked teams qualify for the next (in our case the 2009-10) UEFA Champions League Group stage, the fourth, for the 2009-10 UEFA Champions League Play-off round, the fifth and the sixth, for the 2009-10 UEFA Europe League Play-off round and the seventh, for the 2009-10 UEFA Europe League Third qualifying round.The three worst ranked teams are relegated to the Football League Championship 2009-10.
In football, and also in any sports competition, there is a strong interest in knowing which team (in a collective sport) or which player (in an individual sport) shall be the champion at the end of the championship.Besides this, the end result of a match, the chance to qualify for a specific tournament, the chance of being relegated, the best attack, the best defense, among others, are also subject of interest.
Several papers are found in literature considering football score prediction applied to championship leagues such as the English Premier League (Lee 1997, Everson Goldsmith-Pinkham 2008, Karlis Ntzoufras 2009), the Norwegian Elite Division (Brillinger 2006) and the Brazilian Championship (Brillinger 2008).Lee (1997) considered a Poisson regression to model the number of goals from football team, where the average reflects the strength of the team, the quality of the opposition and the home advantage (if it is the home team).The independence between the goals scored by the two teams was assumed and his methodology was applied to the 1995-1996English Premier League. Recently, Brillinger (2008) proposed to model directly the win, draw and loss probabilities.In that paper, Brillinger employed a trinomial model and applied it to the Brazilian 2006 Series A championship to obtain the estimate probability of any particular team to be champion, estimate the team's final points and to evaluate the chance of a team to be in the top four places.Karlis Ntzoufras (2009) applied the Skellam's distribution to model the goal difference between home and away teams.The authors argue that this approach does not rely neither on independence nor on the marginal Poisson distribution assumptions for the number of goals scored by the teams.A Bayesian analysis for predicting match outcomes for the English Premiere League (2006League ( -2007 season) season) is carried out using a log-linear link function and non-informative prior distributions for the model parameters.
In this paper, following Lee (1997), we estimate the average goals scored by each team by assuming that the number of goals scored by a team in a match follows a univariate Poisson distribution but we consider linear models that express the sum and the difference of goals scored in terms of five covariates: the goal average in a match, the home team advantage, the team's offensive power, the opponent team's defensive power and a crisis indicator.Generally, a football team may pass through a crisis when there is not a good relationship among players, between players and the coach or fans and the team, the coach criticizing player(s) and vice versa etc.This occurs most often when the team obtains successive negative results (successive losses and draws or bad performances).
The objective of this paper is to present a simple method with good predictive quality, easy implementation, low computational effort that allows the calculation of the interest probabilities: which team will be the champion, which ones will be relegated, which ones will qualify to another tournament, which team will be the best home team, which team will be the best away team (the team that scores the most points playing outside their hometown) etc.The model is applied to the 2008-2009 English Premier League.The Definetti measure (DeFinetti 1972) is used to quantify the model predictive quality.
To perform the forecasts we use directly from the Poisson model and the estimated means.We generate the score of the matches to be provide in order to estimate the simple matches predictions and also simulate several whole tournament to obtain the probability to be champion, to be relegated, to reach among the three best ranked teams etc.
The paper is outlined as follows.Section 2 presents the probabilistic model.Section 3 presents the results of our proposed modeling fitting applied to the 2008-09 UEFA Champions League.In this section we also quantify our modelling predictive quality by considering the Definetti measure (DeFinetti 1972).Section 4 presents the results of a simulation study performed for estimating some probabilities of interest such as, single match, champion, classification for the 2009-10 UEFA Champions League group phase and relegation.In Section 5 final considerations about the results and further work concludes the paper.

Probabilistic Model and Estimation
In this section we present the probabilistic model and estimation procedure.To illustrate we use as an example a fictitious tournament in which played the teams: Arsenal, Chelsea, Liverpool and Manchester United, whereas Liverpool is in crisis during the period in which the tournament is being played.In this tournament we will assume that occurred the following results The final match of this tournament will be between Manchester United and Arsenal (the top two ranked teams) in Old Trafford Stadium.Based on the performance of teams in the tournament, the idea is to try to predict the possible chance of teams win, draw or lose the final match.
Thus, for a given match in a competition, consider X and Y the number of goals scored by the home and away teams.Henceforth, we shall assume X and Y independent Poisson distributed random variables with means λ X and λ Y , respectively.
From the linear properties of expectation, it follows that and for i = 1, 2, 3, ..., n, where n is the number of matches in the dataset; ε ai and ε bi are independent errors with mean equal to 0. In linear model (3), (X + Y ) i is the total of goals scored (by both teams) in the i-th match; the vector α is composed by N + 2 parameters (N parameters associated with each one of the teams listed in the dataset; one parameter associated with the home advantage and one parameter associated if one of the teams (or both) is in crisis).
The row vector S i has N + 2 element, where N is associated to the status of each team over the match in question, one component that indicates the home advantage and one component that indicates if one of the teams (or both) is in crisis.
The status of a team is a variable incidence that takes the value 1 if it participates in the i-th match or 0 if not participating.
Assigning a common value for both teams involved in the match is due to the fact that the value of (X + Y ) i not depend on identifying which team is X or Y .For example, the results 2 × 1, 3 × 0, 1 × 2 and 0 × 3 mean equally the occurrence of three goals ((X + Y ) i = 3).
The component relating to home advantage is also an incidence variable that can assume the values 1 if the match was played at the stadium of one of the teams or 0 it was in a neutral stadium.The crisis component is also an incidence variable that can assume the values 1 if one of the teams (or both) is in crisis or 0, otherwise.
For example the proposed application, in the first match of the tournament we have that (X + Y ) 1 is equal to 0 + 1 = 1 and the vector α given by α Arse α Chel α Liver α M anc α Home α Crisis t , the row vector S 1 becomes equal to 1 0 1 0 1 1 .
Considering jointly all tournament matches, X + Y becomes the vector of goals total and S the matrix n × (N + 2) of status, local and crisis.For tournament matches, the model (X + Y In a linear model ( 4), (X − Y ) i is the difference of goals scored by teams X and Y in the i-th match; the vector β is composed by N + 2 parameters (N parameters associated with each one of the teams listed in the dataset; one parameter associated with the home advantage and one parameter associated if one of the teams (or both) is in crisis).From the practical point of view, in football, some teams play better as home team, others as an away team.In addition, a team that is going through a crisis has the pressure to achieve a positive result, which may impact on their performance in the match.Note that one team is in crisis in the i-th match but after some positive results or a major victory against a rival team can move out from this crisis status.The same applies if the opposite occurs.Motivated for these reasons we added these covariates in the model, relating them to the number of goals scored by the teams X and Y.
The row vector T i has N + 2 element, where N is associated to the status of each team over the match in question, one component that indicates the home advantage and one component that indicates if one of the teams (or both) is in crisis.
In this model, if the team participates in the i-th match the status is a variable incidence that takes the value 1 (for team X) and −1 (for team Y).If not participating, the status is equal to 0. It is necessary to distinguish between teams X and Y because the value of (X − Y ) i depend directly identifying which team is the X and Y .For example, the results 3 × 2 and 2 × 3 have meanings completely different because (X − Y ) i = 1 and (X − Y ) i = −1, respectively.The component relating to home advantage is also an incidence variable that can assume the values 1 if the match was played at the stadium of one of the teams or 0 it was in a neutral stadium.The crisis component is also an incidence variable that can assume the values 1 if one of the teams (or both) is in crisis or 0, otherwise.
For example the proposed application, in the fourth match of the tournament we have that (X − Y ) 4 is equal to 0 − 2 = −2 and the vector β given by β Arse β Chel β Liver β M anc β Home β Crisis t , the row vector T 4 be- Considering jointly all tournament matches, X − Y becomes the vector of goals difference and T the matrix n × (N + 2) of status, local and crisis.For tournament matches, the model ( Thus, in example, the estimators of λ 1 and λ 2 for the final match between Manchester United and Arsenal, built in can be calculated from So, for the final match between Manchester United and Arsenal held in Old Trafford Stadium, the vectors S 7 and B 7 are given by S 7 = [ 1 0 0 1 1 0 ] and T 7 = [ −1 0 0 1 1 0 ], where we obtain the point forecasts where X and Y are, respectively, the number of scored goals by Manchester United and Arsenal held in Old Trafford Stadium. Assuming equal weights, from the Moore-Penrose generalized inverse matrix (Venables Ripley, 1999) we obtain the desired estimatives α and β given by α = (S S) −1 S (X + Y ) = 1.750 2.500 0.375 2.000 −1.500 0.375 Soon, and Thus, we obtain the estimates and Then, the "least squares score" for the final match would be (1.68 × 0.57).Note that we can encounter a problem when When this occurs, we obtain λ Y < 0 in expression (2).To overcome this problem, we use the "walk" to the set of valid estimative.If we equate λ Y to the nearest value of valid estimative, i.e, λ Y = 0, we would be assuming that the team Y has no chance of winning the match, what in football is not feasible, since even a team with technical quality much lower than the opponent, we can observe the possibility of occurrence of this team win (unexpected result).Thus, in the few situations where this problem occurred, we equate the rate of team Y to a small value, i.e., λ Y = 0.25.

Deriving the probabilities
For a given match played by teams X and Y, we calculate the probabilities of win (P W ), draw (P D ) and loss (P L ) of team A from the predictive distributions, using the following equations and From the values of λ M anc and λ Arse obtained in ( 5) and ( 6) we can calculate the probability of Manchester win, draw and Arsenal win through the expressions (7), ( 8) and ( 9), respectively.Thus, we obtain the following probabilities To assemble the final league tables, for each M matches predicted, see if there was victory of team X (X > Y ), draw (X = Y ) or victory of team Y (X < Y ).Give 3 points for the winning team and 1 point for both teams if there was a draw.From the current league table, update with the simulated results for each of the n simulated championships.By the final league tables, we can calculate, for example, the chance of a particular team to be champion and to be relegation as follows P[Team to be champion] = #(team finished in the first place in the final league table)/n , P[Team to be relegation] = #(team finished in the last three placed in the final league table)/n, where # refers to the number of times obtained in the simulation.

Data Analysis
In this section we present the results obtained by applying your methodology to the English Premier League, particularly to the 2008-09 Premier League season.We focus on the single match predictions as well as on the predictions for the whole Tournament.

General Data Structure and Assumption
We used as our data set the outcomes of the 180 first matches (18 rounds played) of the 2008-09 Premier League season to perform the predictions of the following 200 matches (from 19th to 38th rounds).The first 18 rounds were chosen as the training set since only after 18 rounds we have observed one match of each team against to the other opponents.The team crisis indicator, stated for each team in each round, was based on the midia information about the team.
In order to apply our methodology, we argue that the number of goals scored by each team in a match follows a univariate Poisson distribution, and the assumption of independence is presumed.We however confirm such assumption through a naive Chi-square (χ 2 ) test.
Considering the number of goals scored by the home team we observed χ 2 obs = 2.1313 on 5 degree of freedom with the critical value equals to χ 2 c = 15.0863 at 1% of significance.Considering the number of goals scored by the away team we observed χ 2 obs = 10.3086 on 4 degree of freedom with the critical value given by χ 2 c = 13.2767 at 1% of significance.Therefore, as in both cases we have smaller observed values than the critical ones there is insufficient evidence to reject the hypothesis that the goals follow a Poisson distribution.The test which shows that there is no evidence against the assumption of independence was performed through a cross-tabulation of the home and away number of goals scored of all 380 matches in the 2008-2009 English Premier League, with an observed value equals to χ 2 obs = 16.9325 on 16 degree of freedom, which is smaller than the critical one χ 2 c = 31.9999at 1% of significance.

Quality of the Predictions
As pointed out in Section 1, the Definetti measure (DeFinetti 1972) was used to quantify our modeling predictive quality.Consider the set of all possible forecasts given by the simplex set S = {(P W , P D , P L ) ∈ [0, 1] 3 : P W + P D + P L = 1}, where P W denotes the win probability, P D denotes the draw probability and P W denotes the loss probability.The vertices (1, 0, 0), (0, 1, 0) and (0, 0, 1) of S represent the outcomes win, draw and loss, respectively.Thus, following (DeFinetti 1972), we calculate the DeFinnetti distance, which is the Euclidean distance, between the point corresponding to the outcome and that one corresponding to the prediction.For instance, if a prediction is given by (0.45, 0.20, 0.35) and the outcome is a victory (1, 0, 0), then the DeFinetti dis-tance is given by (0.45 − 1) 2 + (0.20 − 0) 2 + (0.35 − 0) 2 = 0.465.Also, we can associate the average of their DeFinetti distances to a set of predictions, known as the DeFinetti measure.
Before each one of the 20 remaining rounds (19th to 38th rounds) we calculated, the win, draw and loss probabilities (see Subsection 2.1) for all matches and the Definetti measure (DeFinetti 1972) associated with these predictions.
For each round, the associated Defineti measure obtained was 0.479, 0.510, 0.667, 0.508, 0.457, 0.700, 0.502, 0.650, 0.629, 0.498, 0.680, 0.645, 0.432, 0.545, 0.568, 0.499, 0.312, 0.575, 0.529 and 0.332.We can observe that in almost all rounds the Defineti measures were less than 2/3 equivalent to an equiprobable predictor (which assigns equal probability for all outcomes P W = P D = P L = 1/3.Moreover, in the modeling we consider the two covariates home and crisis status covariates, relating them to the number of goals scored by the teams X and Y.To check if these covariates impact the modeling, we fit the model again but removing one or both covariates of the model and calculate the probabilities of victory, draw and loss and also the corresponding Definetti measure.A better prediction, which is the main interest here, was obtained by considering the full model with the two covariates.

Single Match Prediction
In this section, we present the forecasts for all the matches of the 35th round which are shown in Table 1.If we calculate the percentage of correct forecasts, a forecast (P W , P D , P L ) shall be considered correct if the outcome with the greatest probability coincides with the observed outcomes, then our model scored 9 results with a Definetti measure equals to 0.3119.

Predictions for the whole Tournament
In this section, we present the final prediction of the classification on the teams) and to end up in the last place.

Overall Results
In this subsection, based on our 1,000 tournament replications, several types of interesting information can be obtained, such as: how many times a team was the champion, how many times a team finished in the first three positions, the variability of the number of points, goals scored, goals taken, the number of victories, losses, draws etc.
All the results are presented in terms of average.Initially, based on the observed data before the 20th round, Figure 1 displays the box-plots of the 1,000 predicted numbers of points for each team at the final of the tournament.We observe that based on the available data, the fitted model indicates Manchester United as favorite team for winner of the tournament, followed by Chelsea, Everton and Liverpol.In fact, Manchester United won the tournament and Liverpol finished at second place, indicating an improvement in the performance of the Liverpol team.Also, Tables 6 and 7 at the Appendix, displays the probabilities of each one of the 20 positions that each team reached at the end of the championship by considering the observed data before the 25th and 30th rounds, respectively.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Manchester United

Some Specific Results
To qualify for the 2009-10 UEFA Champions League Group stage, the teams need to finish among the three best ranked teams.The probabilities of the six teams that had some probability of reaching the top three places are presented in Table 4.In Table 4, the teams that qualified (Manchester United, Liverpool and Chelsea) had a higher probabilities of finishing among the three best ranked teams.
Another probability of interest is the one of the teams that will be relegated.In a round-robin tournament, there is an extensive dispute, both to be champion or to qualify for any tournament, but also not to be relegated.The teams are relegated when they finish among the three worst ranked teams.The probabilities victories (in the 2009-10 UEFA Champions League case, 173 victories for the home teams, the 110 victories for the away teams and 97 draws).However, before each round, we do not know the match outcomes, then assign an equal chance to the three possible outcomes (win, draw, loss) seems a reasonable strategy that is independent of subjectivity.However, other possibilities should be considered further.For instance, if one knows in advance the chances of win, draw and loss, but note that in this case, subjectivity is taken into account.

P
for the simulation Suppose the tournament is composed by N rounds.For each round r, r = N/2, ..., N , we obtained the final team classification, i.e., number of points, number of victories, number of draws, number of defeat, number of goals scored, number of goals conceded and goal differences.The forecast for the final classification was performed using a simulation based on Poisson model involving the following steps a) Fix n the number of championships to be simulated and r the number of round played.Do c = 1 (the counter); b) If c < n use the (r − 1) * 10 observed matches to estimate the home and away teams goal rates; c) For each one of M = [N − (r − 1)] * 10 matches to be played, simulate the number of goals scored using the Poisson distribution with estimated rates obtained in step (b).Do c = c + 1 and return to step (b).
of the number of points obtained by each team before 20th round.

Table 1 :
Forecasts for single matches of the 35th round.

Table 5 :
Simulation results of the five teams that had more percentage in finishing in the last three places.

Table 6 :
Mean, standard deviation of final league ranks, and probability of each ranking (in %)-25th round.

Table 7 :
Mean, standard deviation of final league ranks, and probability of each ranking (in %) -30th round.