REGRESSION FOR COMPOSITIONAL DATA WITH COMPOSITIONAL DATA AS PREDICTOR VARIABLES WITH OR WITHOUT ZERO VALUES

Compositional data are positive multivariate data, constrained to lie within the simplex space. Regression analysis of such data has been studied and many regression models have been proposed, but most of them not allowing for zero values. Secondly, the case of compositional data being in the predictor variables side has gained little research interest. Surprisingly enough, the case of both the response and predictor variables being compositional data has not been widely studied. This paper suggests a solution for this last problem. Principal components regression using the α -transformation and Kulback-Leibler divergence are the key elements of the proposed approach. An advantage of this approach is that zero values are allowed, in both the response and the predictor variables side. Simulation studies and examples with real data illustrate the performance of our algorithm.


Introduction
Compositional data are positive multivariate data whose vector elements sum to the same constant usually taken to be 1 for convenience.Data of this type arise in many fields such as geology, ecology, archaeometry, economics, geochemistry, biology, political sciences and forensic sciences, showing their wide applicability.Their support, termed simplex, is given by where D denotes the number of variables (better known as components).
There are various techniques for regression analysis with compositional data being the response variables.See for example Aitchison (2003) who used classical methods on the log -ratio transformed space and Gueorguieva et al. (2008) who applied Dirichlet regression.
Stephens (1982) and Scealy and Welsh (2011) transformed the data on the surface of the unit hyper-sphere, using the square root transformation, and thus treat them as directional data.
An important issue in compositional data is the presence of zeros, which cause problems for the logarithmic transformation.The issue of zero values in some components is not addressed in most papers and especially in the task of regression.When zero values exist in data, Dirichlet models and the log-ratio transformation suggested by Aitchison (1982Aitchison ( , 2003) ) and Egozcue et al. (2003) will not work unless a zero value imputation is applied first.The square root transformation on the other hand, the  -regression and divergence based regression models treat the zero values naturally.More recently, Tsagris and Stewart (2018) proposed a Dirichlet regression modified to account for zero values.As for the classification setting, Tsagris (2014) proposed the use of a power transformation applicable to cases with zero values in the data.
Most papers focus on compositional data being in the response variable side.The case of compositional data in the predictor variables side was treated first by Hron et al. (2012) who applied the isometric log-ratio transformation, defined in (4), to the compositional data and then applied a standard linear regression model.The case of both the dependent and the independent variables containing compositional data has been treated by Wang et al. (2015) in the context of variable selection.Wang et al. (2013) suggested the use of the isometric log -ratio transformation (4) on both sides.Wang et al. (2010) is the one closest to our work who transformed the compositional data using (4) and then applied partial least squares.
The isometric log-ratio transformation (4), or the more general -transformation (6), reduces the dimensionality of the compositional data by 1, via the sub-Helmert matrix (the Helmert matrix (Lancaster, 1965) without the first row).Collinearities in the data may still exist, and this is why Tsagris (2015b) suggested the use of principal components regression (Jolliffe, 2005).A second advantage of the latter approach is that unlike the isometric log -ratio transformation (4), the -transformation (6) is applicable when zero values are present and no zero values imputation is necessary.This is very important, since in large datasets with many zero values, (not necessarily sparse data) imputation is not a suggested strategy.In addition, Tsagris (2015b) showed that -PCR can lead to better predictions, with a value of  other than zero.
We propose a solution combining some of the aforementioned papers.Specifically, we engage the -principal components regression (-PCR) for the independent compositional data and the multinomial logit regression (MLR) model (Murteira and Ramalho, 2016) for the response compositional data.We not only substitute the partial least squares of Wang et al.
(2010) with PCA but we also generalize either of these methods since a more general transformation than a logarithmic is applied.In addition, zero values are treated naturally, without any modification of the data or conditional distributions.
In the next section we present some preliminaries regarding the statistical analysis of compositional data, the MLR model (Murteira and Ramalho, 2016) and the  -PCR (Tsagris,2015b).The new approach is presented in Section 3. Simulation studies are presented in Section 4 and a demonstration with real data is given in Section 5. Finally, the conclusions close the paper.

Preliminaries
Below we summarize some important information regarding compositional data that will help us throughout this paper.We start with the listing of some relevant transformations, continuing with the -PCR and the multinomial logit regression.

Additive log-ratio transformation
For a composition  ∈   , the additive log-ratio transformation is defined in Aitchison (1982) as where   is the last component playing the role of the common divisor, the choice of which is not restrictive, since any component can play this role.

Centered log-ratio transformation
The drawback of ( 2) is that it treats the components asymmetrically.The interpretation of the results, in many cases, depends upon the common divisor.For this reason Aitchison (1983) suggested the centered log-ratio transformation ) , for  = 1, … , . ( Note that the zero sum constraint in Equation ( 3) is an obvious drawback of this transformation as it can lead to singularity issues.

Isometric log-ratio transformation
In order to remove the redundant dimension imposed by this constraint, one can apply the so called isometric log-ratio transformation (Egozcue et al., 2003) where  is the Helmert matrix (Lancaster, 1965) (an orthonormal  ×  matrix) after deletion of the first row.This is a standard orthogonal matrix in shape analysis used to overcome singularity problems (Dryden & Mardia, 1998;, Le & Small, 1999).By left multiplying the data with the Helmert matrix (without the first row) the data are mapped onto ℝ −1 and the zero sum constraint is removed.

𝜶-transformation
Tsagris, Preston & Wood (2011) developed the  -transformation as a more general transformation than that in Equation (4).Let )  (5) denote the power transformation for compositional data as defined by Aitchison (2003).In a manner analogous to Equations (3-4) they defined the -transformation to be The -transformation (6) converges to the isometric log-ratio transformation (4) as  tends to zero (Tsagris et al., 2011).

𝜶-Principal components regression
Hron et al. ( 2012) studied the case of compositional data being predictor variables by using (4).Tsagris (2015b) has already covered this case and we are interested in generalizing this idea to cover the case of multicollinearity as well.
Principal components regression (PCR) is based on principal component analysis (Jolliffe, 2005) and hence we will briefly describe the algorithm for PCR using the -transformation (Tsagris, 2015b): 1.
Choose a value of , apply the -transformation (6) onto the compositional data (independent variables) X and obtain   .

2.
Perform eigen analysis on      and calculate the matrix of the eigenvectors V and the scores  =   .

3.
Perform regression analysis using the scores (SC) as predictor variables.
We will term the above procedure -PCR.The value of  and the number of principal components which lead to the optimal results are chosen via cross validation (CV) described in detail in Section 3.
We can use any number of eigenvectors (or principal component).If we use all of them, then we end up with the fitted values as if we implemented a regression model with all the components of the independent composition.However, our focus is to use a subset of them in order to reduce noise.We do not perform feature selection, nor do we perform any statistical inference regarding the coefficients of the principal components.From the perspective of an applied scientist or a researcher from another field (e.g.ecologist), statistical inference is important.They would be interested in the significance of the predictor variables.But the way we have formulated our approach, the significance of the predictor variables is not feasible.
In addition, we are more interested in estimating the dependent compositions as accurately as possible.That is we are more interested in the predictive performance of the model.

Multinomial logit regression
When compositional data are in the response variables side (the usual and most studied case), Murteira and Ramalho (2016) mentioned the use of the Kullback-Leibler divergence of the observed from the fitted compositional vectors where ) and y and x are the compositional response variables and the set of predictor variables respectively.
Closed form solution for the minimization of (7) does not exist, but the use of the Newton -Raphson algorithm (Böhning, 1992) can speed up the minimization process.The advantage of using ( 7) instead of the classical multivariate regression after the additive (2) or the isometric (4) log-ratio transformation, is that zeros can be treated naturally and require no further changes or modifications.

Compositional-compositional regression
Compositional-compositional regression refers to the case of both the response and the predictor variables consisting of compositional data.We will denote the response or dependent variables by response or dependent composition and the independent variables by independent composition or predictor composition.We have defined all the prerequisites necessary to construct our proposed methodology.All we need to do is couple the -PCR with the MLR.
Hence, we will substitute x in (7) with  =   , the scores of the PCA after applying the -transformation to the independent composition, and (7)  ).
Similarly to Tsagris (2015b) we will use the K-fold CV protocol to choose the optimal values of  and , the number of principal components.According to the K-fold CV protocol we split the data into K distinct sets, and each time remove one fold and use K−1 sets to build the compositional-compositional regression.The fold remained outside the model construction is used as a test set to validate the model.The performance or predictability of the model is measured by the Kullback-Leibler divergence of the true from the fitted compositional vectors.The optimal results in our case, chosen values of  and , the number of Principal Components (PCs), will refer to minimization of the mean Kullback-Leibler divergence of the observed from the predicted compositional data.
The response and the predictor compositions need not be of the same dimensions and the use of the MLR is by no means restrictive.One could also use the usual multivariate linear regression model (Mardia et al., 1979) on the log-ratio transformed data, Dirichlet regression (Gueorguieva et al., 2008) and its adjusted version for zero values (Tsagris and Stewart, 2018), the ordinary least squares (Murteira and Ramalho, 2016), the ES-OV regression (Tsagris, 2015a) or any other regression model for compositional response variables.We propose the MLR model though because not only it handles zero values naturally, but also because it is very fast to fit.

Simulation studies
We conducted simulation studies to assess the performance of the proposed algorithm.We examined our algorithm's capability of recovering the true values of  and the true number .We used the open software R (R Core Team, 2017) for all our simulation studies an data analysis.
We used the fgl dataset available in the MASS library in R (Ripley, 2002).The data concern measurements of forensic glass fragments and consist of 214 observations on 8 chemical elements.The dataset contains a huge amount of zeros and suits excellent for our purpose.We chose a value of  and applied the  -transformation (6).In the transformed data, we calculated the 7 PCs.We took the scores from the first PC and multiplied it with some generated beta coefficients.We then generated data from a multivariate normal and added some white noise to them.The data were then mapped onto the simplex using the inverse of the additive log-ratio transformation (2).Keeping the value of  constant we repeated this procedure using all 7 PCs, one by one.We then chose another value of  and repeated this process.
In all cases, positive values of  were used only and specifically we used 10 values, ranging from 0.1 up to 1 equally spaced.The above process was repeated 200 times for each combination of  and number of PCs.For each combination we computed the difference between the true and the estimated value of , and the percentage the times the algorithm chose the correct value of .Tables 1 and 2 summarize our findings and Figure 1 illustrates them.
Table 1 contains the average bias of  when the dependent composition was generated from one, two or three PCs.The estimated bias of  is always small, and is not really affected by the true value of .When 4 or more PCs were used, the bias increases.
Table 2 contains the proportion of correct identification of the number of PCs.Whether our methodology selected the true number of PCs used does not depend so much on the true value of , but rather on the number of PCs.By examining the table in column-wise manner, we see that as we move from left to right, the percentage of correct identification of the true number of PCs decreases.

Example datasets
We will now introduce four datasets, taken from Aitchison (2003), to illustrate the performance of our proposed approach described in Section 3. The sample sizes (see Table 3) are rather small.Hence, we will not perform a 10-fold CV, but a leave-one-out CV (LOOCV) protocol and we will report the optimal pairs of parameters (values of  and ) found.
• Clam ecology.From the many colonies of clams in East Bay, 20 colonies from the East Bay were selected at random and from each a sample of clams was taken.For each colony the proportions of clams in each colour-size combination was estimated and the corresponding compositions, consisting of 6 components, were recorded.A similar study was conducted in West Bay and the resulting 20 colour-size compositions.The task of interest is to quantify the relationship between the two compositions.
• Hair and eye colours.For each of the 33 counties of Scotland the percentages of boys in five hair-colour categories and four eye-colour categories are available.The question of interest is to predict the composition of the hair colours from the eye colour compositions and vice versa.
• White-cells.A cytologist is interested in the possibility of introducing into his laboratory a new method of determining the white-cell composition of a blood sample, that is the proportions of the three kinds of white cells, among the total of white cells observed.The current method involves time-consuming, microscopic inspection and is known to be accurate, whereas the proposed method is a quick automatic image analysis whose accuracy is still largely undetermined.In an experiment to assess the effectiveness of the proposed method, each of 30 blood samples was halved, one half being assigned randomly to one method, the other half to the other method.It is fairly obvious that the two methods produce different compositions and we are interested in predicting the microscopic inspection compositions from the compositions produced by the new method.
• Fruit evaluation.The yatquat tree produces each season a single large fruit whose quality is assessed in terms of the relative proportions by volume of flesh, skin and stone.In an experiment to investigate whether a certain hormone influences quality, an agricultural PREDICTOR VARIABLES WITH OR WITHOUT ZERO VALUES scientist uses 40 yatquat trees, randomly allocates 20 trees to the hormone treatment and leaves untreated the remaining 20 trees.The fruit compositions of the 40 trees in the present and the preceding season are available.The task is to examine the relationship between the compositions in the two seasons.
In the Clam ecology and Hair and eye colours datasets there is no dependent and independent variables.Hence, we will perform regression on both ways.In the White-cells dataset, the response composition is the microscopic inspection and the independent composition is the new method of image analysis.For the Fruit evaluation dataset, the current season is the response and the past season will play the role of the predictor composition.Hair and eye colour 33 5 4 White-cells 30 3 3 Fruit evaluation 40 3 3

Results
Figure 2 contains the heatmap plots of LOOCV using the first two datasets (Clam ecology and Hair and eye colours), whereas Figure 3 contains the heatmap plots of LOOCV using the other two datasets (White-cells and Fruit evaluation).Table 4 contains the optimal values of  and , along with their corresponding minimum Kullback-Leibler divergence.We see that in most cases a value of  other than zero is the optimal for transforming the predictor compositions.
For the clam ecology dataset, only one PC is necessary to reach the optimal predictive performance for both cases.The chosen value of  is not the same whatsoever, whereas for the hair and eye colour dataset the combination of  and  is similar in both cases.In these two datasets, there was no distinction between response and predictor composition, but for the white cells and fruit evaluation dataset there is.For the white-cells, the  = 0.2 and  = 2 PCs lead to the optimal predictive performance, while for the fruit evaluation the optimal transformation was the -transformation with  = 1 using  = 1 PC.This means, that the data were not -transformed, but centered and multiplied by a constant, prior to the Helmert multiplication.
Let us now make an observation for the Clam-ecology II example dataset, where the optimal value of  was −1.We remind the reader, that if zero values are present in the independent composition, only positive values of  are allowed.

Conclusions
In this work we suggested a methodology for the case of both the response and the predictor variables consisting of compositional data.The approach is based upon combining prior work for regression with compositional data.Our simulation studies and examples illustrated the proposed methodology showing that it works satisfactorily.We should highlight also, that instead of PCA or partial least squares (Wang et al., 2010) other methods could be used as well.In any case, the -transformation (6) should be applied to the independent composition as it gives more flexibility than the isometric log-ratio transformation (4) and allows for zero values.
The multinomial logit regression is not the only available regression model.In the case of no zero values, the usual log-ratio methodology or any other regression model could be used as well.In addition, principal components is again not the only dimensionality reduction technique.Principal coordinate analysis, or kernel PCA (Schölkopf et al., 1997), more general, could be used to capture non-linear dependencies among the compositional data.
Our simulation studies indicated that the bias in the estimated  increases as the true number of PCs required increases.A similar pattern was observed when selecting the number of PCs.The higher the true number of PCs, the lower the probability of selecting their exact number.
The examples with real data analysis illustrated the performance of our proposed methodology.
Our future research is oriented at exploring the performance of our proposed method-ology in higher dimensions, for either the response or the predictor composition.This  In addition, the robustness to outliers and the case of zero values is to be investigated as well.
Outliers in the predictor composition can be addressed via robust PCA

Figure 2 :
Figure 2: Heatmap plots of the Clam ecology and Hair and eye colours datasets.The horizontal axis contains the  values and the vertical axis the number of PCs used.The values of the Kullback-Leibler divergence are plotted, with low values being desirable.

Figure 3 :
Figure 3: Heatmap plots of the White-cells and Fruit evaluation datasets.The horizontal axis contains the  values and the vertical axis the number of PCs used.The values of the Kullback-Leibler divergence are plotted, with low values being desirable.

Table 2 :
Proportion of times the correct number of PCs was selected by the 10-fold CV for a range of values of  and number of PCs.

Table 3 :
Information about the example pairs of datasets used.The sample size and the number of components for each composition is given.

Table 4 :
Results for each dataset.The optimal value of  and number of PCs along with the minimum average Kullback-Leibler divergence.