INVESTIGATING THE UNDERLYING CAUSAL NETWORK ON EUROPEAN FOOTBALL TEAMS

: Football, or soccer, is considered one of the most important col- lective sports in the world. Managers, specialists and fans are always trying to find out the important keys to have a good team. The evaluation of the team quality may present many variables and subjective concepts, and for this reason, it is not simple to answer the following question: How to define quality? Another point that should be considered is the importance of aspects such as offensive and defensive: Which one is more important to measure quality of a football team? For this task, we propose the use of a causal model with latent variables as a tool to measure the subjectivity of the team quality and how it can be affected by other aspects. Information from the four most important football leagues in the world (England, Germany, Italy and Spain) during three seasons (2011-2012; 2012-2013; 2013-2014) was collected. We defined the latent variables in the model and evaluated the relationships among them. The results show that the offensive aspect exert more influence on team quality than defensive aspect, which reflects directly on the players market strategies.


Introduction
Football or soccer, name assigned in USA, is a collective sport played by two teams with eleven players each. Football is considered one of the most popular and important sports in the world, being played in every nation without exception (Reilly and Williams, 2003). One example of its popularity is the number of countries affiliated to the Fédération Internationale de Football Association (FIFA), which is higher than the ones affiliated to the United Nations (UN) and International Olympic Committee (IOC) (Louzada et al., 2014).
The simplest way to explain the game objectives is to focus in the results. Both teams play against each other in a field with their players composed mostly of a goalkeeper, defenders, midfielders and strikers. They try for 90 minutes, split in two periods, to score goals and avoid then. It is considered a goal when the ball crosses the goal line between the goalposts (Reilly and Williams, 2003;Louzada et al., 2014). Unlike many other sports, instead of only two possible results (win and lose), football allows a third one, the draw. The team that scores the most goals is considered to be the winner, and when both teams score the same amount of goals it is considered a draw. The score is assigned to each team at the end of the event match: three points are given for the winner team, zero points for the loser team and if a draw occurs one point is assigned for each team (Reilly and Williams, 2003;Louzada et al., 2014).
Football also carries an economic issue that can be seen in many different aspects. Every year the amount of spent money with players transfers potentially increases, mainly in European most important leagues, such as Barclays Premier League (England), Bundesliga (Germany), Serie A (Italy), BBVA league (Spain), among others. Some cities use the football games as a tourist attraction, specially every four years when the world cup happens, the biggest event of football wherein currently 32 nations, initially divided into eight groups of four teams each, is held in a different country, mobilizing not only those participating countries but many others around the world (Lee and Taylor, 2005).
Although the FIFA World Cup is the most famous and important football event, it has a huge "lucky" effect on it, since each team does not play against all other teams, i.e., an arbitrary team could be benefited or harmed depending on which teams are in the same group as yours in the first phase of the championship and hence, it does the search for quality to be more complex and in some situa-tions it is almost impossible to find a pattern and/or a consensus about quality. In order to avoid the "lucky" effect, we performed our study using the four most important football leagues of the world (English, German, Italian and Spanish) for three different seasons (2011-2012, 2012-2013 and 2013-2014).
Different scientific studies focused in specific objectives related to football have been widely proposed in the last decades. For instance, in medical sciences some studies are related to fitness, e.g. performing better strategies to improve the strength, stamina and to avoid injuries of the players, or trying to assess whether a football player is able to return without any risk (for further details, see Delvaux et al., 2014;Meckel et al., 2014;Stubbe et al., 2015). Also, some studies have been aimed in predictions about possible results in a specific match or a championship as in Goddard (2005), Suzuki et al. (2010) and Louzada et al. (2014), or analyzing external factors that may directly influence the match outcome as proposed by Nevill et al. (1996), Taylor et al. (2008) and Staufenbiel et al. (2015) or even to study the football betting market as in Dixon and Coles (1997), Dixon and Pope (2004) and Goddard and Asimakopoulos (2004). Recently, the study of game-related statistics has been receiving considerable attention from researchers, football industry and specialists as a powerful mechanism to improve a team, since it can measure its quality and highlight its most important players (see, e.g. Poulter, 2009;Castellano et al., 2012;Moura et al., 2014).
However, the evaluation of the team quality may present many variables and subjective concepts and for this reason it is not simple to answer the following question: How to define quality of a football team? Moreover, another point that should be considered is the importance of offensive and defensive aspects. Which one can be considered more important to measure the quality of a team? A suitable answer for these questions can be derived using the concept of causal models under latent variables, that allow us to measure those subjective concepts of the team quality and how it could be affected.
The amount of researchs using causal models have increased during the past decades and it became an important tool to verify causal relationship between systems that contain observed variables, specially in human sciences, where they are usually trying to study causal effects concerning to subjective aspects such as intelligence, aspirations or political interventions (further details can be seen in Haavelmo, 1943;Duncan et al., 1968;Bollen, 1995;Lee and Zhu, 2000;Bollen, 2002;Ferron and Hess, 2007;Greene, 2011).
For the use of causal models two aspects ought to be considered: Graph analysis (GA) and structural equation model (SEM). The GA involves searching for causal structures that qualitatively represent how variables are causally connected, while the SEM with a well-known causal structure allows to infer the magnitude of causal relationships. Also, SEM can be considered as multiple-trait regression models in which some response variables may be represented as covari-ates in the right-side of the equations for the other response variables In the literature, causal models have been widely studied under two approaches. The first approach uses the latent variables and then relates the causal structure among latent variables. This approach is interesting for situations in which subjectivity aspects or unmeasured variables are used and their relationships are able to infer the causality. The second approach uses the structure without latent variables, that should be considered when the variables are measured and the relationship between them are used to infer causality (Duncan et al., 1968;Lee and Zhu, 2000;Bollen, 2002;Lee and Tang, 2006;Rosa et al., 2011).
The GA has some particular notations that should be mentioned and are necessary for a better comprehension of this model: i) variables inside a circle are called latent variables; ii) variables inside a rectangle are the observed variables; iii) arrows represent a causal effect; and iv) double arrows represent correlation. Moreover, we classify the explanatory variables as exogenous and the response variables as endogenous, and also this notation can be extended to latent variables.
The remainder of the paper is outlined as follows. In Section 2, we introduce a brief description of the data set. We discuss some statistical inference for the causal models via structural equation model such as maximum likelihood ratio method and some model selection criteria, in Section 3. The results given in Section 4 reveal the usefulness of the selected causal model under latent variable for analyzing real data. Concluding remarks are addressed in Section 5.

Data set
We choose to use championships as league in order to minimize the "lucky" effects such as a bad day, bad draw, or any external intervention, that could happen in championships as cups. The data used in this paper comes from the four most important football leagues affiliated to Union of European Football Association (UEFA) in Europe (Barclays Premier League from England, Bundesliga from Germany, Serie A from Italy and BBVA league from Spain) related to the past three seasons (2011-2012, 2012-2013 and 2013-2014).
All of these leagues present the same structure, where all teams play against each other twice, i.e., home and away game. Despite all similarity among them, Bundesliga is composed by 18 teams in a total of 34 games whereas the other leagues under study represent 20 teams in a total of 38 games. Another difference among those leagues is the way that the teams who will be playing the UEFA Leagues (Champions and Europe) and the number of relegations are chosen. To avoid any problems with the different amount of games for each championship, we used all information per game.
The information evaluated in this study consists in 32 different variables: win (total, home and away), draw (total, home and away), lose (total, home and away), points rate (total, home and away), goals favor, goals against, goals difference, shots, shots on goal, clean sheet, offsides, fouls, yellow and red cards, fouled (received fouls), tackles, interception, possession, dribble, shot conceded, pass accuracy, position, classification to UEFA league and relegation. This data set is available for consulting at http://www.whoscored.com. Table 1 presents a descriptive summary for some variables of data set divided by leagues using the three above mentioned seasons. BBVA League, while on the away games is on the Bundesliga. If we consider all leagues, the performance in home games is almost 60% percent greater than the performance of away games. In general, the best attack in a season scores almost four times more than the worst attack. For any team belonging to the English league, we can see that at least 2% of the games were finished without being scored, while for teams from other leagues this minimum percentage vary from 7.895 up to 10.526. Bundesliga presents the team who avoided being scored the most (61.765%). In average, around 25% of the shots made were in the goal direction in all leagues. All teams along the season obtained at least more than 60% of passing accuracy, received at least almost one yellow card per game and in average received a red card each 10 games.
We split the data set into five possible groups that present similar characteristics in the game field, e.g the amount of fouls and cards, since a player could receive a card according to the amount of fouls in the game or their intensity, or the shots or offsides, since several shots on goal come from through ball.

Causal Inference
In this section, we are interested in creating some variables which could represent different subjective aspects, such as offensive, defensive, quality, etc. After that, we are able to use the causal models under the latent variables framework to perform the inferential procedures. In order to achieve the best possible model, we propose different relationships between variables with different latent structures and also allow the covariance relationships between all observed and latent variables.

Structural equation model
Here, we use the structural equation model to estimate the effects above. We can note that SEM consists in two distinct parts. The first part is due to the development of a set of equations related to the causal relations between latent variables (further details in Bollen, 1995;Lee and Zhu, 2000;Bollen, 2002;Lee and Tang, 2006). The model can be expressed as η = Bη + Γξ + ζ where η represents the vector of latent endogenous variables, B is the matrix of loading coefficients that gives effects of ηj on ηi with diagonal equal to zero, Γ represents the matrix of loading coefficients that gives the effects of ξj on ηi, ξ is the vector of latent exogenous variables which follows a multivariate normal distribution with mean 0 and covariance matrix given by Φ, and ζ is the vector of errors for the latent variable η, which has multivariate normal distribution t with mean 0 and covariance matrix given by Ψ. Here, we assume that ξ and ζ are not correlated.
We can note that η has a multivariate normal distribution with mean equals to 0 and covariance matrix given by . The second part of SEM is used to verify how the observed variables are related to latent variables. The model for the observed variables structure can be written as Y = Λy η + ε and X = Λx ξ + δ, where X is the matrix of observed variables related to latent exogenous variables with dimension (m × kx), Y is the matrix of observed variables related to latent endogenous variables with dimension (n × ky ), ε with dimension (n × ky ), and δ, with dimension (m × kx), represent the matrix of errors in equations with covariance matrix given by Θε, with dimension (p × p), and Θ δ , with dimension (q × q), respectively.
The joint probability density function for the observed variables X and Y follows a multivariate normal distribution N (0, Σ) and the covariance matrix, Σ, is given by where p and q are the parameters related to each covariance matrix. The max-imum likelihood estimate (MLE) Θ of Θ is the solution of the score vector for Θ

Model selection
In this section, we shall apply different measures as tools to verify (among all models considered) which should usually be taken as the best model for describing the given data set.
In the SEM's context, a model is considered suitable if the covariance structure implied by the model is similar to the covariance structure of the sample data, as indicated by an acceptable value of goodness-of-fit index (GFI) (Cheung and Rensvold, 2002). In the literature, the most popular GFI used in SEM is the χ 2 statistic. However, a problem arises because χ 2 statistic has a sample size dependence. For instance, the χ 2 statistic provides a highly sensitive statistical test for large sample sizes, but not a practical one.
To overcome this problem, many authors have been proposed GFIs as alternative to χ 2 statistic in last decades. Some of them are the Comparative Fit Index (CFI) (Bentler, 1990), Tucker-Lewis Index (TLI) (Tucker and Lewis, 1973), Normed Fit Index (NFI) (Bentler and Bonett, 1980) and root mean squared error of approximation (RMSEA) (Steiger, 1989). In this paper, we performed the methods suggested by Bollen (1995) and Kline (2011), i.e the CFI, TLI and RMSEA.

Comparative Fit Index (CFI)
The CFI is an incremental fit index that measures the relative improvement in the fit of the proposed model over that of a baseline model, typically the independence model. Its formula can be expressed as where Ĉm and Ĉ b are the sample minimum discrepancy for the proposed and baseline models, respectively and dfm and df b are the degrees of freedom for the proposed and baseline models.

Tucker-Lewis Index (TLI)
The TLI is an incremental fit index which was developed against the disadvantage of Normed Fit Index regarding being affected by sample size. TLI is

Root Mean Squared Error of Approximation (RMSEA)
In recent years, the RMSEA has become regarded as one of the most informative fit indexes due to its sensitivity to the number of estimated parameters in the model. (Diamantopoulos and Siguaw, 2000) In other words, the RMSEA favours parsimony in that it will choose the model with the lesser number of parameters (Hooper et al., 2008).
The RMSEA is computed based on sample size and the non-centrality parameter and degrees of freedom for the proposed model given by where F θ= max ̂ , 0 and dfm is the degrees of freedom for the proposed model.
For the first two measures (CFI and TLI) values close to one indicate the better models and for the RMSEA values smaller than 0.05 are considered better acceptable models. All the computation were performed using lavaan, simsem, semPlot and sem Tools packages available in the statistical software R (R Core Team, 2015).

Results and discussion
For the data set described in Section 2, we create five latent variables based on the observed variables in order to explain several subjective aspects that specialists usually bring forward during discussion regarding football. These aspects are defensive, offensive, discipline, creation and quality. Subsequently, we introduce the causal relationship among all latent variables and consider a structure for selecting the best model based on the three measures mentioned in Section 3.2. In order to estimate the selected model, we consider the maximum likelihood ratio method, discussed in Section 3.1. Table 2 lists the estimates of the parameters and the relationships between observed and latent variables.

Latent variable for the football data
In this section, we create five latent variables (offensive, defensive, creation, discipline and quality) based on the observed variables from the football data set. We also give some comments about the relations (positive or negative) between latent and observed variables.

Offensive
It is suggested that the latent variable offensive is composed by goals favor, shots, shots on goal, offsides and wins, and all these variables are positively related to offensive. Also, it is possible to verify that wins and goals favor are the variables that present more contributions to offensive aspect. On the other hand, it is possible to observe that offsides is the variable that contributed less (around 21% of the effect of goals). These relations make sense since for any victory at least a score is needed. Further, we can observe a high relation between shots on goal and goals, as well as the offsides and goals, since a lot of creation in football come from through ball.

Creation
Creation is positively related to percentage of passes completed, possession and dribbles while it is negatively related to interceptions. Passes and ball possession are the variables that can better explain the creation variable. In this case, interception is considered negative since these results come from the fact that the ball is in possession of the other team.

Defensive
The latent variable defensive is defined by goals against, shots conceded and clean sheet. The first two variables are positively related to defensive while the clean sheet has a negative relation. In absolute values, we observed that clean sheet and shots conceded are equivalent. These relations are well expressed, under the game point of view, since a team that spends more goals without being scored is expected to receive less goals during the whole season.

Discipline
Discipline was positively related to fouls, yellow and red cards. We can observe that the difference between the smaller and greater value is around 22%. These relationships can explain what actually happens on a football field, since the players can receive the cards for several reasons such as the amount of fouls or the intensity of fouls.

Quality
Quality is positively related to points rate, goals difference, home and away points while it is negatively related to classification to European leagues and posi-tion. These relations can be explained by the reason that for a good classification it is expected a higher punctuation at home and away games. Goal difference has the same explanation because more victories implies more goals in favor than against.

Causal relationship between latent variables
After the development of the offensive, defensive, creation, discipline and quality variables, we proposed the causal relationship among them and several scenarios were provided to achieve a structure which could be represented for the best model. The values of CFI, TLI and RMSEA measures for the best model are 0.982, 0.974 and 0.069, respectively. All relationships between latent and observed exogenous variables are presented in Table 2 and the causal relationship among all of latent variables is displayed in Figure 1. We can observe that the offensive and defensive characteristics are correlated to each other as well as discipline and creation aspects, without any causal meaning and for this reason these relations are expressed by two-headed curved arrows. On the other hand, we observe that creation and discipline present direct cause effects on offensive and defensive aspects, respectively, which are represented by two single headed straight arrows. Also, it is possible to visualize that the offensive and defensive characteristics affect directly the football team quality. Table 3 shows that the creation variable is considered as cause of the offensive variable which exerts an effect equals to 1.088. In the same way, we can observe that the discipline variable causes an effect on the defensive aspect, considering the effect equals to 2.554. Moreover, both variables (defensive and offensive) affect the team quality with coefficients equal to -0.194 and 0.920, respectively. According to Table 3, we can also infer that the discipline variable has an indirect effect (discipline effect × defensive effect) in quality equal to -0.4954 while the creation variable has an indirect effect (creation effect × offensive effect) in quality equal to 1.00096. Based on Table 3, we can assume that interventions can be used in order to improve offensive characteristics, which implies a gain in the team quality almost five times more than interventions realized on defensive aspects. We also note that the indirect effect provided by creation is almost the same as the direct effect provided by offensive in relation to team quality.
The negative effect between defensive and quality variables is explained by the fact that defensive variable is related to goals against and then, the more goals conceded by a team, the worse is its quality. The positive effect between discipline and defensive variables can be explained by the fact that fouls generate more chances for the team shot on goal and thus more scores may be done.

Observed and latent variables correlated
In this Section, we also make some comments about the correlations (positive or negative) of the observed and latent variables. Table 4 shows the correlation among some observed variables which were considered statistical significant for the fitted model and the covariance between latent variables. Offensive, defensive, creation and discipline We can observe that offensive and defensive are negatively related to each other as well as the creation and discipline, both results are perfectly explained, given that the team which presents more creation is more disciplined because it has more possession of the ball during the match and consequently its number of fouls and cards (yellow and red) will be smaller. In the same sense, we can explain the negative correlation between offensive and defensive aspects, considering that, during a match, more shots and time on offense represent less opportunities to the opponent.

Fouls, yellow and red cards
For the observed variables, we note that yellow cards are positively correlated to red cards and fouls, and also fouls are positively related to red cards by the reason that, in general, many red cards are given after the yellow cards.

Offsides, interception, goals against, clean sheet
The offsides are positively related to interception, in the game context this relation is explained by the reason of some interceptions are directly linked to the offensive aspect and on most of the time some players are not paying properly attention and do not follow the game speed. Goals against present a negative covariance in relation to clean sheet. This fact leads us to observe that games with lots of goals are not the standard during the championship.

Possession, pass accuracy, shots conceded
Possession is positively related to shot and pass accuracy, and negatively related to shots conceded and fouls. These results are widely discussed by specialists because more possession leads to less opportunities for the opponent and then it concedes less shots and prevents the defense to do many fouls. On the other hand, more possession means that a team usually presents a better pass accuracy and thus, it creates more chances to shots.

Concluding remarks
In this paper, we proposed the use of causal models under latent variables for the task of measuring the football teams quality. We noted that this approach allowed us to measure those subjective concepts of the teams quality and how it could be affected by others characteristics. In order to avoid the "lucky" effect, we performed our study using the four most important football leagues of the world (English, German, Italian and Spanish) for the last three seasons. We also discussed some statistical inference for the causal models through the structural equation model using the maximum likelihood ratio method and selected the model by CFI, TLI and RMSEA measures. The results revealed that the team quality is explained by offensive aspect around five times more than the defensive characteristic and also the creation variable exerted an important effect on team quality. Furthermore, the results expresses the strategies related to the players market well, where the most valuable players (higher salaries and sponsorship values), generally presents offensive skills which appears more developed, such as, midfielders, forwards and strikers. The importance of the players with offensive skills is noted in the best player of the year awards, where in 24 editions, only once a player which plays on the first half of the field received the prize. Moreover, we have evidenced that the stand for the usage of causal models as an efficient tool to explain and quantify is useful in terms of the relationships which are always treated as opinions for many specialists.