Modeling Compositional Regression with uncorrelated and correlated errors: a Bayesian approach

Compositional data consist of known compositions vectors whose components are positive and defined in the interval (0,1) representing proportions or fractions of a"whole". The sum of these components must be equal to one. Compositional data is present in different knowledge areas, as in geology, economy, medicine among many others. In this paper, we introduce a Bayesian analysis for compositional regression applying additive log-ratio (ALR) transformation and assuming uncorrelated and correlated errors. The Bayesian inference procedure based on Markov Chain Monte Carlo Methods (MCMC). The methodology is illustrated on an artificial and a real data set of volleyball.


Introduction
Compositional data are vectors of proportions specifying G fractions as a whole.Such data often result when raw data are normalized or when data is obtained as proportions of a certain heterogeneous quantity.By definition, a vector x in the Simplex sample space is a composition, elements of this vector are components and the vectors set is compositional data [2].Therefore, for x = (x 1 , x 2 , . . ., x G ) ′ to be a compositional vector, x i is non negative value, for i = 1, . . ., G and x 1 + x 2 + . . .
The first model addopted for the analysis of the compositional data was the Dirichlet distribution.
However, it requires that the correlation structure is wholly negative, a fact that is not observed for compositional data, in which some correlations are positive (see for example, Aitchison [2]).
Aitchison and Shen [4] developed the logistic-Normal class of distributions transforming the G component vector x into a vector y in R G−1 and considering the Additive Log-Ratio (ALR) function.The use of Bayesian methods is a good alternative for the analysis of compositional data (see for example, Iyengar and Dey, [11,12]; or Tjelmeland and Lund, [19]), especially considering Markov Chain Monte Carlo (MCMC) methods.
The main purpose of the paper is based on the Bayesian approach for the compositional regression model assuming correlated and uncorrelated normal errors.Usually, the data (attack, block, serve and opponent error) from this type of sport have compositional restrictions, i.e., they have dependence structure, being that standard existing methods to analyze multivariate data under the usual assumption of multivariate normal distribution (see for example, Johnson et.al [13]) are not appropriate to analyze them.
We consider a real data set that is related to the first and second rounds matches of Brazilian Men's Volleyball Super League 2011/2012 obtained from the website [6].The data concern the teams that played and won the games in such rounds; more specifically, the points of the team that won each game were defined as composition and the proportions of each composition are the volleyball skills, as attack, block, serves and errors of the opposite team.
The points of the winning team in each game were obtained by four components.We denoted x 1 the proportion of points in the attack, x 2 the proportion of points in the block, x 3 the proportion of points in the serve and x 4 the proportion of points in the errors of the opposite team.
The paper is organized as follows: Section 2 introduces the formulation of regression model applied through the Additive Log-Ratio (ALR) transformation; Section 3 reports a Bayesian analysis of the proposed model assuming correlated and uncorrelated Normal errors; Section 4 provides the results of the application to an artificial and a real data set related to the Brazilian Men's Volleyball Super League 2011/2012; finally, Section 5 ends the paper with some final remarks.

Formulation of the Model
We can consider y ij = H(x ij /x iG ), i = 1, ..., n and j = 1, ..., g, being H(•) the chosen transformation function that assures resulting vector has real components, where x ij represents the i-th observation for the j-th component, such that x i1 > 0, . . ., x iG > 0 and G j=1 x ij = 1, for i = 1, ..., n.The ALR transformation for the analysis of compositional data is given by The regression model assuming ALR transformation for the response variables is given by where y ij = (y i1 , . . ., y ig ) is a vector (1 × g) of response variables where g = G − 1 and G number of compositional data components; z i is a vector of covariates associated to the i-th sample; β 0j is a vector (1 × g) intercepts; β 1j is a vector (p × g) regression coefficients and ǫ ij are random errors, for j = 1, . . ., G − 1 and i = 1, . . ., n.

Bayesian analysis considering ALR transformation
This section presents a Bayesian analysis of the model ( 2) with ALR transformation (1) applied to response variables and assuming multivariate Normal distribution for the correlated and uncorrelated errors.

Bayesian analysis considering ALR transformation assuming uncorrelated errors
The points of the winning team in each game were obtained by four components.We denoted x 1 the proportion of points in the attack, x 2 the proportion of points in the block, x 3 the proportion of points in the serve and x 4 the proportion of points in the errors of the opposite team, i.e, they are the dependent variables defined for our study.On the other hand, we considered five independent variables (covariates): player who scored more points in the game belongs to the winning team (z 1 ), the winning team has won League at least once in the last twelve years (z 2 ), percentage of excellent reception of the winning team in the game (z 3 ) and percentage of excellent defense of the loser team in the game (z 4 ).
The regression model obtained to transformed data y i1 , y i2 e y i3 is given by where the covariates associated with the i-th game are described above, y ij represents the transformed proportion of the j-th component (attack, block, serve and errors of the opposite team) in the i-th game, β 0j represents the mean of the points proportion in the j-th component related to component x i4 (errors of the opposite team) for the team that did not win the Super League, β 1j , β 2j , β 3j , β 4j indicate a possible covariate effect on the i-th game and ǫ ij represents the error vector assuming independent random variables with a Normal distribution N (0, Σ 1 ), where 0 is a vector of zeros and Σ 1 variance-covariance matrix defined by The likelihood function of parameters ν 1 = (β 0 , β 1 , β 2 , β 3 , β 4 , σ 2 ) is given by where , for j = 1, 2, 3 and i = 1, . . ., 128.An alternative statistical approach for the analysis of compositional data is the use of Bayesian methods (see for example, Iyengar and Dey [11]; or Tjelmeland and Lund [19]), especially considering Markov Chain Monte Carlo (MCMC) methods (see for example, Gelfand and Smith [10]).
The Bayesian inference allows to associate previous knowledge of the parameters through a prior distribution.The Bayesian inference procedure for regression model (3) considers proper prior distributions guaranteeing proper posterior distributions.Furthermore, it was ensuring non-informative prior distributions according to the fixed hyperparameters.Thus, we assume the following prior distributions for the parameters ν 1 where IG(c,d) denotes an Inverse-Gamma distribution with mean d/(c−1) and variance for c > 2; a 0j , b 0j , a lj , b lj , c j and d j are known hyperparameters, j = 1, . . ., 3 and l = 1, . . ., 4.
All the parameters were assumed independent a priori.
Posterior summaries of interest for the model (3) assuming prior distributions (5) are given using simulated samples of the joint posterior distribution for ν 1 obtained using the Bayes formula, that is, The conditional posterior densities using Gibbs sampling algorithm (Gelfand and Smith [10]) for each parameter are given by, where For the estimation procedure we consider joint estimation where all the model parameters are estimated simultaneously in the MCMC algorithm.The conditional densities above ( 6), (7), (8) belong to any known parametric density family.Posterior summaries of interest for each model are simulated using standard MCMC methods through the Just Another Gibbs Sampler (JAGS) program ( [15]).We used the rjags package ( [16]) interacting with R software ( [17]).

Bayesian analysis considering ALR transformation assuming correlated errors
This section consider correlated errors for the model given in (3) with multivariate Normal distribution, i.e., ǫ ij represents the errors vector assumed to be dependent random variables with a multivariate normal distribution N 3 (0, Σ 2 ), where 0 is a vector of zeros and Σ 2 variance-covariance matrix is given by Considering the assumptions above, the likelihood function of parameters is given by For the Bayesian analysis, we assume the same prior distributions (5) for the β 0j , β lj and σ j , j = 1, 2, 3 and l = 1, . . ., 4. The Uniform prior was considered for ρ = (ρ 12 , ρ 13 , ρ 23 ) given by All the parameters were assumed independent a priori.
Posterior summaries of interest for the model defined by ( 3), but with correlated errors assuming priors distributions (5) are given using simulated samples of the joint posterior distribution for ν 2 obtained using the Bayes formula, that is, The conditional posterior densities for each parameter are given by, where for i=1,. . .,n; j=1,2,3 and l=1,2,3,4.
For the estimation procedure we consider joint estimation where all the model parameters are estimated simultaneously in the MCMC algorithm.The conditional densities above (11), ( 12), ( 13), (14) do not belong to any known parametric density family.Posterior summaries of interest for each model are simulated using standard MCMC methods through the Just Another Gibbs Sampler (JAGS) program ( [15]).We used the rjags package ( [16]) interacting with R software ( [17]).

Application
This section reports a simulation study for the compositional data and illustrates an application of the proposed methodology through ALR transformation based on data related to proportions of the points of volleyball teams.
We considered one dichotomized covariate, namely z i1 (player who scored in the i-th game belongs to the winning team) generated through z 1 ∼ Bernoulli (0.8) and z i2 (percentage of excellent reception of the winning team in the game) generated through z 2 ∼ Normal (0.5, 0.1).
The simulation study was based on 1000 samples generated for each case mentioned above.Sample We used the rjags package ( [16]) interacting with R software ( [17]).
Table 1 shows the simulation results, i.e, mean, standard deviation (SD) and coverage probability (CP).The CP was stable and close to the nominal coverage.
Table 2 shows the Bayesian criteria for the model assuming uncorrelated and correlated errors.The model assuming correlated errors is better when compared to the other model in all considered criteria.

Real data application
In this section, we consider a Bayesian analysis of the real data set presented in the Appendix (Table A) to illustrate an application of the proposed methodology, in particular, data related to proportions of the points volleyball teams.We apply the compositional data methodology to this set considering as components proportions the winning team points in 128 games of Brazilian Men's Volleyball Super League 2011/2012.This study was based on the four components: attack (x i1 ), block (x i2 ), serve (x i3 ) and errors of the opposite team (x i4 ), for i = 1, . . ., 128.The proposed model in (3) and the following independent proper prior distributions (5) were considered: β 0j ∼ N (0, 1000), β lj ∼ N (0, 1000), σ 2 j ∼ IG(0.1, 100), where l = 1, 2, 3, 4 and j = 1, 2, 3.For proposed regression model with correlated errors (9), we considered the same independent proper prior distributions for β 0j , β lj and σ 2 j , for l = 1, 2, 3, 4 and j = 1, 2, 3.It was simulated 100.000Gibbs samples using the rjags package ( [16]) interacting with R software ( [17]), in which the first 10.000 simulated samples were discarded to eliminate the effects of the initial values and we considered every 20th sample among the 90.000Gibbs samples.The convergence was verified through Gelman-Rubin diagnostic.It shows values very close to 1 indicating convergence of the simulation algorithm.
According to Carlin and Louis ([8]), the most basic tool for investigating model uncertainty is the sensitivity analysis, that is, making reasonable modifications to the assumption, recomputing the posterior quantities of interest and seeing if they have changed in a way that has practical impact on interpretations.Thus, we checked the sensitivity analysis for different choices of prior parameters (β 0j , β lj and σ 2 j , for l = 1, 2, 3, 4) by changing only on parameter at a time and keeping all other parameters constant to their default values.We observe that posterior summaries of the parameters do not present considerable difference and not affect the results.
Table 3 shows the posterior summaries for the parameters of the model (3) assuming uncorrelated and correlated errors based on these 4.500 final simulated Gibbs samples.
Note that, the estimated posterior means and standard deviations present similarity values for the both models (uncorrelated and correlated errors).
Table 3 shows the posterior summaries for the parameters of the model (3) assuming uncorrelated and correlated errors based on these 9.000 final simulated Gibbs samples.The convergence was verified through Gelman-Rubin diagnostic.It showed values very close to 1 indicating convergence of the simulation algorithm.Note that there is significant difference regarding to the proportions attack, block and serve points indicating by the estimated β 11 , β 31 , β 42 and β 43 for both models (uncorrelated and correlated errors), i.e., the player who scored in the game belongs to the winning team, percentage of excellent reception of the winning team and percentage of excellent defense of the loser team help it in these skills.Moreover, the estimated posterior means and standard deviations present similarity values for the both models (uncorrelated and correlated errors).We also observe that more parameters were significant in the correlated model than uncorrelated model, i.e., β 12 , β 13 , β 21 , β 22 , β 32 and β 41 .

Table 1 :
Simulation Data.Summary of the posterior distributions for the models parameters assuming uncorrelated and correlated errors.

Table 4
presents the Bayesian model selection criteria expected Akaike information criterion (EAIC ), expected Bayesian information criterion (EBIC ), deviance information criterion (DIC ) and summary statistics of the CPO ′ i s (LP M L = n i=1 log( CP O)).These results are suggesting that fitted regression model assuming correlated errors is the best choice (lower values EAIC, EBIC and DIC).

Table 3 :
Summary of the posterior distributions for the models parameters assuming uncorrelated and correlated errors.

Table 4 :
Bayesian Criteria for the models parameters assuming uncorrelated and correlated errors.