Allometric Extension for Multivariate Regression

In multivariate regression, interest lies on how the response vector depends on a set of covariates. A multivariate regression model is proposed where the covariates explain variation in the response only in the direction of the first principal component axis. This model is not only parsimonious, but it provides an easy interpretation in allometric growth studies where the first principal component of the log-transformed data corresponds to constants of allometric growth. The proposed model naturally generalizes the two–group allometric extension model to the situation where groups differ according to a set of covariates. A bootstrap test for the model is proposed and a study on plant growth in the Florida Everglades is used to illustrate the model.


Introduction
In multivariate regression, interesting and interpretable results are possible if the model can incorporate the dependencies between the response variables as well as the dependencies of the response variables on the predictor variables.Full rank multivariate regression models are often used in practice where each response variable is fitted separately (e.g., Johnson and Wichern, 1998, p.420).If the most of the variability in the p-dimensional fitted values from a multivariate regression lie in a lower dimensional space, then reduced-rank regression models provide a more parsimonious modeling of the data (e.g., Anderson, 1951Anderson, , 1999;;Reinsel and Velu, 1998) where the fitted values are constrained to lie in a lower dimensional space.In this paper, we consider a rank one multivariate regression model where the line containing the fitted values coincides with the first principal component axis of the response variables.
This work was motivated by modeling the dependency of allometric growth of plants in the Florida Everglades on soil characteristics.Allometric growth models are used to quantify the relationship between the size and shape of organisms (e.g., Hills 1982;Jolicoeur, 1963;Klingenberg, 1996;Klingenberg and Froese, 1991;Mosimann, 1970).The allometric model stipulates that parts of the plant grow at constant relative rates.Rates of allometric growth, as suggested by (Jolicoeur, 1963), are estimated using the eigenvector of the covariance matrix of the log-transformed measurements associated with the largest eigenvalue.The first eigenvector is then regarded as the allometric direction for modeling rates of growth of the different parts of the plant.
In studies of two related groups (e.g.males and females) an allometric extension model (Hills, 1982;Bartoletti, Flury and Nel, 1999) holds when one group is an extension of the other group along a common allometric axis.In Section 2 we show that the multivariate regression model with fitted values constrained to lie along the first principal component axis of the response distribution provides a natural generalization to the two-group allometric extension model to the situation where groups differ according to a set of covariates.In Section 3 we discuss the estimation of the allometric extension model for regression.A bootstrap test for the allometric extension model in regression is proposed in Section 4 and the test is applied to the plant data in Section 5.The paper is concluded in Section 6.

Allometric Extension for Multivariate Regression
In growth studies of two groups (e.g.males and females), one group may be an extension of the other group along a common allometric axis.The twogroup allometric extension model states that the two groups share a common first principal component axis and the two groups are lined up along this common axis (Hills, 1982).The allometric extension model generalizes easily to more than two groups (e.g.closely related species).Schnute (1984) defines a closely related model based on a mixture model for bivariate data yielding a trend line which is similar to the line of allometric direction.
Let µ 1 and µ 2 denote the mean vectors for the two groups with corresponding covariance matrices Ψ 1 and Ψ 2 .Let β (1) 1 and β (2) 2 denote the normalized eigenvectors of each respective covariance matrix associated with the largest eigenvalue.Formally, the allometric extension model can be stated as follows (Flury, 1997, Section 8.7): The following notation allows the generalization of the two-group allometric extension model to the multivariate regression setting: let X denote a q-variate random vector of regressors and let Y denote a p-variate response vector.The multivariate regression analogue of allometric extension will be defined in terms of the conditional distribution of Y given X.It will be assumed throughout this paper that the same set of regressor variables are used to model each response variable.Form the (p + q) × 1 random vector (Y , X ) with mean µ = (µ y , µ x ) and partitioned covariance matrix We will assume that the conditional expectation of Y given X is linear as is the case for the multivariate normal distribution and, more generally, elliptical distributions.The multivariate linear regression model can be expressed as: To generalize (2.2) to the multivariate regression setting, we require that the conditional means E[Y|X = X 1 ] and E[Y|X = x 2 ] for arbitrary values x 1 and x 2 of X line up along a common axis and this axis needs to coincide with the first principal component axis of Y.

Definition.
Let β 1 denote the eigenvector of ψ yy associated with the largest eigenvalue.The conditional distribution of Y given X follows an allometric extension model for multivariate regression if where x 1 and x 2 are arbitrary values of X and δ ∈ is a constant that may depend on both x 1 and x 2 .
The allometric extension model for multivariate regression states that the fitted values of the regression of Y on X must all line up on the first principal component axis of Y.In other words, the variability in Y explained by X is along the first principal component axis of Y only.
A convenient characterization of the allometric extension model is given by the following result.Suppose ψ yy and ψ xx are both of full rank.Let H = [β 1 , β 2 , . . ., β p ] denote the orthogonal matrix of eigenvectors of ψ yy corresponding to the ordered (largest to smallest) eigenvalues.Partition H as H = [β 1 : H 2 ] where H 2 is the p × (p − 1) matrix of the second through pth eigenvectors.It is easy to show that the allometric extension model for multiple regression is equivalent to H 2 ψ yx = 0.
(2.5) From (2.5) it follows that for some a ∈ q .If the allometric extension model for multivariate regression holds, then the components of X are uncorrelated with all but the first principal component of Y.
Note that the allometric extension model for multivariate regression is a special case of a reduced rank regression with rank equal to one with the additional constraint that the line containing the fitted values from the multivariate regression coincides with the first principal component axis of Y.In particular, from (2.5), it follows that Thus, the covariance matrix for the conditional mean of Y given X has only one non-zero eigenvalue with corresponding eigenvector β 1 .If Y and X have a joint normal distribution, then the covariance matrix of the conditional distribution of Y given X is ψ yy − ψ yx ψ −1 xx ψ xy .If the allometric extension model for multivariate regression holds as well, then the covariance matrix for the conditional distribution of Y given X can be written From (2.6) it follows that β 1 is an eigenvector for the conditional distribution of Y given X which provides a generalization of (2.1) from the two-group model.The allometric extension model for regression is illustrated in Figure 1.

Estimating the Allometric Regression Model
A straightforward approach to estimating the parameters of the allometric extension model is to fit a full-rank regression in the usual way and then project the fitted values onto the first principal component axis of the sample covariance matrix of Y. Let S denote the sample covariance matrix and partition S as and let β1 denote the eigenvector of S yy associated with the largest eigenvalue of S yy .
Consider the regression model where Y has been centered at zero.If a reduced-rank regression of rank one holds, then the coefficient matrix C can be factored as where C 1 is a p × 1 vector and C 2 is a 1 × q vector.C 1 can be estimated using the eigenvector of S yx S −1 xx S xy associated with the largest eigenvalue.C 2 is then estimated using Ĉ1 S yx S −1 xx (e.g., Reinsel and Velu, 1998).In other words, fitting the reduced-rank regression model amounts to projecting the usual leastsquares coefficient matrix S yx S −1 xx onto the space spanned by the first eigenvector of S yx S −1 xx S xy .For the allometric extension model for regression, the coefficient matrix can be estimating by projecting the usual least-squares coefficient matrix onto the space spanned by β1 , the first eigenvector of S yy .Define the projection matrix P as P = β1 β 1 .
Then the estimated coefficient matrix for the allometric extension model is given by PS yx S −1 xx . (3.1) An alternative estimation approach is to use normal theory maximum likelihood.However, there do not appear to be closed form solutions and the likelihood equations are extremely messy (details are not given here).

Testing for Allometric Extension
This section describes a bootstrap test that can be used to determine if an allometric extension model for multivariate regression is consistent with the data or not.The null hypothesis of the test is that the allometric extension model holds for the distribution.In an allometric study where the first principal component corresponds to the direction of allometric growth, the bootstrap test will then allow the investigator to determine if the regressor variables affect growth of the plant or animal along this allometric axis, or if the regressor variables disturb the allometric growth of the plant by altering the coefficients of allometric growth.
Using the notation from Section 2, define the p × p orthogonal matrix H = [β 1 , H 2 ] whose columns are the eigenvectors of Ψ yy .If Z 2 denotes the second through pth principal components of Y, then Z 2 = H 2 Y (assuming Y has been centered at zero).From (2.5), the hypothesis of allometric extension for multivariate regression is equivalent to the components of Z 2 being uncorrelated to the components of X: If H 2 were known, the normal theory likelihood ratio test (e.g., Anderson's book, p.394) is equivalent to and the asymptotic null distribution of n times the natural logarithm of (4.1) would be chi-square with degrees of freedom equal to (p − 1)q.In practice the eigenvectors of Ψ yy must be estimated.Let Ĥ2 denote the matrix of eigenvectors of S yy associated with the second through pth ordered eigenvalues.Multiplying the natural logarithm of (4.1) by n after replacing H 2 with Ĥ2 leads to the following test statistic (after simplifying using standard results for partitioned matrices): The finite sample and asymptotic distributions of (4.2) are unknown due to the fact that H 2 needs to be estimated.
A straightforward alternative to the parametric approach is to use a bootstrap test which does not require the normality assumption.The basic idea is to compute (4.2) using the raw data.Next, transform the raw data so that the null hypothesis is approximately valid (Hall and Wilson, 1991).When testing a hypothesis about the mean, this transformation involves centering the raw data at the hypothesized value of the mean (see e.g., Westfall and Young, 1993).In the current setting, the null hypothesis concerns the covariance structure.In order to transform the data so that the null hypothesis of allometric extension holds, the data needs to be re-scaled (instead of centered) -details are below.Finally, the re-scaled data is resampled and (4.2) is computed for the bootstrap samples.An (approximate) p-value for the bootstrap test is computed as the the proportion of the bootstraped test statistics that exceed the test statistic computed from the raw data.
In order to describe how to re-scale the data so that the null hypothesis of allometric extension holds, let BDB denote the spectral decomposition of ψ, the covariance matrix of (Y , X ) .The matrix D is a diagonal matrix of eigenvalues of ψ and B is the orthogonal matrix of eigenvectors of ψ.Let P = β 1 β 1 denote the projection matrix for the eigenvector β 1 associated with the largest eigenvalue of ψ yy .Then from (3.1), the covariance matrix is a covariance matrix that satisfies the allometric extension model.Let ψ 0 = B 0 D 0 B 0 denote the spectral decomposition of ψ 0 .Then we can transform the original variables (Y , X ) to have a covariance matrix equal to ψ 0 by noting the following: Therefore, left multiplying the (p + q)-dimensional vector (Y , X ) by (B 0 D 1/2 0 ) (D −1/2 B ) yields a distribution satisfying the allometric extension model.
To perform the bootstrap testing, one simply re-scales the raw data using the sample counterparts of the above matrices and then resamples the data.In particular, let (y i , x i ) , i = 1, . . ., n, denote the raw data and let S denote the sample covariance matrix.Let B D B denote the spectral decomposition of S where B is an orthogonal (p + q) × (p + q) matrix and D is a diagonal matrix of ordered eigenvalues of S. Let Denote the spectral decomposition of S 0 by The bootstrap testing procedure is given in the following steps: Bootstrap Testing Procedure 1. Compute λ, the test statistic using (4.2).
2. Re-scale the raw data so that the null hypothesis of allometric extension (approximately) holds by computing: 3. Obtain a bootstrap sample (sampling with replacement) from the re-scaled data (y i0 , x i0 ) and compute the test statistic (4.2), call it λ * .
5. A bootstrap p-value is then computed by the proportion of λ * j that exceed λ.
This bootstrap procedure is quite easy to implement using standard statistical software because it only requires the computation of the eigenvectors and eigenvalues of the sample covariance matrix.Code for the bootstrap test is available using the R-software (Ihaka and Gentelman, 1996) from the authors upon request.
A simulation study was conducted to evaluate the performance of the bootstrap testing procedure.Multivariate normal data was simulated using the Gauss software for a wide variety of parameterizations and sample sizes.The power of the bootstrap test was estimated based on 1000 simulated data sets.For each simulated data set, the bootstrap test described above was performed for N = 1000 bootstrap samples.Ideally, the p-value distribution when simulating under the null hypothesis of an allometric extension model should be uniform (0, 1).The actual p-value distribution under the null hypothesis is approximately uniform with deviations from uniformity in the direction of a conservative test (i.e., the probability of a type I error is smaller than specified using the usual significance levels).Under the null hypothesis, the p-value distribution coincided more closely with a uniform (0, 1) distribution as the sample size increased or as the dimensionality of the data decreased, or as the correlation between the first principal component of the response and the regressors increased.The power of the bootstrap test was evaluated simulating data from distributions where the allometric extension model does not hold.The results of the power study are summarized by the power curves shown in Figure 2. In all cases, the power was computed using a significance level 0.05.The left panel of Figure 2 shows the results of simulating a bivariate response (p = 2) with two regressors (q = 2).The right panel of Figure 2 is based on p = 4 response variables and q = 3 regressors.In each panel, power curves for sample sizes of n = 50 and 100 are shown.The first principal component of the response had a variance of 4 in the left panel and a variance of 6 in the second panel.The other principal component(s) of the response(s) had a variances of 2 and 3 in the left and right panels respectively.The regressors had unit variance in all cases and the covariance of the first principal component with the regressors was 0.5 in all cases.The power curves in Figure 2 were generated by increasing the covariance from 0 (where the null hypothesis of allometric extension holds) to 0.9 in increments of 0.1 between the regressors and the 2nd through pth principal components of the response.As this covariance increases, the model moves further away from the allometric extension model and therefore the power of the test increases which is evident from the curves in Figure 2. The bootstrap test is slightly less powerful in the higher dimensional simulation (right panel of Figure 2) compared to the lower dimensional simulation.These power curves show that the bootstrap test is quite powerful even when the correlation between the regressors and the 2nd-pth principal components of the response are moderate.For example, in the left panel of Figure 2 with n = 50, the power of the test is 0.947 when the 2nd principal component has a correlation of 0.42 with each of the regressors whereas, in the right panel, the power is 0.942 when the correlation between the 2nd-4th principal components of the response have a correlation of 0.4 with the regressors.Further details on the simulation results are available from the authors.
As mentioned in the previous section, the maximum likelihood estimators of the covariance matrix in the allometric extension model for multivariate regression are intractable and the likelihood equations are very messy.Therefore, a likelihood ratio test for the allometric extension model is not pursued.

Application
The ecological balance of nutrient-limited areas such as the Florida Everglades can be jeopardized by anthropogenic nutrient enrichment.In order to detect and monitor the presence of nutrient contamination in the Everglades, a study was conducted to examine variation in plant morphology that results from soil characteristics.In particular, the negative effects of phosphorus enrichment in the Everglades are a concern.This studied examined Sagittaria lancifolia and Cladium jamaicense (or "sawgrass") plants which are common in the Florida Everglades.Survey sites were randomly located throughout the Florida Everglades and a sample of plants and soil measurements were obtained at each site in work that involved the second author.Data were collected during the wet and dry seasons.
Data on n 1 = 287 Sagittaria lancifolia plants were collected during the wet season and n 2 = 298 plants during the dry season.For the sawgrass, there were 517 and 615 observations in the wet and dry seasons respectively.The response variables were the log-transformed length and width of the leaves: For the sawgrass, an addition response variable was measured: Using the previous notation, p = 2 for the Sagittaria lancifolia plants.The q = 2 soil variables are x 1 = Total phosphorus in soil (units are micrograms/gram) x 2 = Percent ash-free dry weight of soil.Ash-free dry weight is related to the amount of organic matter in the soil and is useful for distinguishing mineral soils, called marl, and peat soils that are higher in organic matter.The ash-free dry weight relates to the accessibility of nutrient availability to plants.Variables x 1 and x 2 were standardized since they are measured on different scales.Also, the response variables y were all centered at zero.Phosphorus is a nutrient that has a fertilizing effect on plants leading to higher growth.Similarly, a regression of the leaf dimensions on ash-free weight also shows a positive relationship.We present first the results for the Sagittaria lancifolia plants.Figure 3 shows fitted values of log(Lamina length) and log(Lamina width) from a fullrank regression on x 1 and x 2 for the wet season data.The fitted values tend to cluster along a line indicating that a reduced-rank regression model with rank 1 may hold for the data.The plants were divided into two-groups based on large (above average) and small (below average) phosphorus values.The large phosphorus group is plotted with a triangle and the small phosphorus group is plotted with a circle.The two different plotting symbols indicate that the large group appears to be an allometric extension of the small group (Bartoletti et al., 1999).However, the large and small grouping is artificial.Because phosphorus and the ash-free dry weight are continuous variables, the allometric extension model for regression needs to be tested.The bootstrap test was run on the data for the wet and dry seasons.50,000 bootstrap samples were obtained for each test using the bootstrap testing procedure outlined in Section 4. For the wet season, only one of the 50,000 bootstrap samples yielded a test statistic bigger than the observed test statistic for the raw data giving a p-value of p = 0.00002.Thus, for the wet season, the allometric extension model for regression is clearly rejected.However, for the dry season, the bootstrap p-value is p = 0.25413 indicating that the allometric extension model for regression is consistent with the dry season data.
Figure 4 provides an illustration of the results.Each frame of Figure 4 shows the raw log-leaf data for the wet (left frame) and dry (right frame) seasons.The solid line in each frame is the estimated first principal component axis for the raw plant data.The dashed line is the estimated first principal component axis for the fitted values.Also plotted are the fitted values (indicated with large circles and triangles corresponding to small and large levels of phosphorus).Under the allometric extension model, the dashed and solid principal component axes would exactly coincide.The discrepancy between the solid and dashed line is greater for the wet season than for the dry season which gives some indication why the allometric extension model is less plausible for the wet season than for the dry season as evidenced by the bootstrap test.
The bootstrap test for allometric extension was also run for the sawgrass data which had three dependent variables (logarithms of leaf length, width and rhizome).For the wet season, the bootstrap test yielded an estimated p-value of p = 0.10314; for the dry season, the p-value is p = 0.00000.Therefore, the allometric extension model for multivariate regression is consistent for the wet season data but not for the dry season data, which is the reverse of the results for the Sagittaria lancifolia plants.
The coefficients for the first principal component z 1 for the wet season sawgrass data are 0.516, 0.555, 0.653 which gives estimates of allometric growth rates.The second principal component z 2 can be regarded as a comparison of length and width with rhizome: z 2 = −0.723log(length) − 0.126 log(width) + 0.679 log(rhizome).
Because the allometric extension model is consistent with the wet season sawgrass data, it follows that phosphorus and ash-free dry weight explain variability in the growth of the plants along the z 1 allometric axis only.The percentage of variability in the responses accounted for by the regressors in a multivariate regression can be computed as Table 1 gives the percentage of variability in the log-leaf measurements explained by the full unrestricted regression model, a reduced rank regression model with rank equal to one and for the allometric extension model for both plants and both season.The allometric extension model is the most restrictive of the three models and therefore must have the lowest R 2 value.However, from Table 1, the proportion of variability explained for the allometric extension model is essentially the same as the full unrestricted model and the reduced rank regression model, particularly in the cases where the bootstrap test indicated that the allometric extension model holds (dry season for Sagittaria lancifolia and wet season for sawgrass).Below is the estimated covariance matrix for the dry season Sagittaria lancifolia plant data of (Z , X ) where Z = (Z 1 , Z 2 ) are the two principal components of the log-leaf measurements and X = (X 1 , X 2 ) corresponds to the standardized ash-free weight and phosphorus measurements: The pairwise covariances between x 1 and x 2 with the second principal component of the leaf measurements appear quite small as would be expected for a model consistent with the allometric extension model.It is interesting to also perform a likelihood ratio test for a reduced rank regression (rrr) model of rank r = 1.The likelihood ratio test for testing if the rank of the coefficient matrix is of rank r is λrrr where S(residuals) is the residual sum of squares matrix from the reduced rank model (Reinsel and Velu, 1998, p.50).Asymptotically, −2 log( λrrr ) follows a chi-squared distribution on (p − r)(q − r) degrees of freedom.For the Sagittaria lancifolia plant data we have (p − 1)(q − 1) = 1 degree of freedom.The observed test statistics for the wet and dry seasons are χ 2 = 0.834 and 0.489 respectively indicating that a reduced rank regression with rank equal to one is consistent for both the wet and dry season leaf data.As we have seen however, the more parsimonious model of allometric extension is consistent for the dry season data.Even though we rejected the allometric extension model for the wet season, the less restrictive reduced rank regression of rank equal to one is not rejected.Also, Figure 4 shows that the allometric extension model for the wet season, although rejected by the likelihood ratio test, nonetheless provides a fairly good approximation to the observed data.

Discussion
In the plant examples of Section 5, departures from the allometric extension model may be of interest in identifying the effects of soil nutrient over-enrichment.For instance, if nutrients in the soil and/or water alter the usual growth rates of the plant, then the allometric extension model for multivariate regression will fail to hold.Since the allometric extension model appears consistent with the data from the dry season for the Sagittaria lancifolia plants and since phosphorus has a fertilizing effect, we would expect to see larger leaves following the same growth pattern in areas suffering from phosphorus contamination compared to uncontaminated sites.Strong departures from the allometric extension in this context may be evidence that the phosphorus is causing differences in the both the size and shape of the leaves, possibly a deformation.
The model considered here is quite simple and many generalizations can be pursued.A more restrictive one-dimensional model is that of isometric growth where the fitted values are constrained to lie on the line determined by the unit vector (1, 1, . . ., 1) / √ p.Another interesting variation of the problem is to consider models where the allometric extension model holds for a subset of the response variables in a multivariate regression (Ivey et al., 2004).For higher dimensional data, one could easily modify the bootstrap test of Section 4 to test if the fitted values lie in a lower dimensional subspace spanned by the first few eigenvectors of the covariance matrix of the response distribution.In fact, one can regard these models as special cases of common principal component models (Flury, 1988) where some or all of the eigenvectors of Ψ yx Ψ −1 xx Ψ xy are constrained to coincide with some of the eigenvectors of Ψ yy .Common principal component models for regression then would postulate that the covariance matrix for E[Y|X] could be expressed as (a j Ψ −1 xx a j )β j β j , where β j are eigenvectors of Ψ yy , the a j ∈ q , and r is the rank of the model.It would also be of interest to consider nonlinear generalizations of the reduced rank models.Hastie and Stuetzle (1989) introduced principal curves as a nonlinear generalization of principal component axes.If there is nonlinear structure in the response distribution, then there may very well be nonlinear structure in the fitted values as well after regressing the response on regressors.If the fitted values fall on a curve, then one would have a nonlinear reduced rank regression model of rank equal to one.If this curve coincides with the first principal curve of the response distribution, then a nonlinear allometric extension model for regression holds for the data.

Figure 1 :
Figure 1: An illustration of the allometric extension model for multivariate regression.

Figure 2 :
Figure 2: Power curves for bootstrap test.Simulation results for the bootstrap test of the allometric extension model.

Figure 3 :
Figure 3: Fitted values of logarithms of leaf length and width from a regression on soil ash-free weight and soil total phosphorus of Sagittaria lancifolia plants.Plants were divided into two-groups: above average values of phosphorus (triangles) and below average values of phosphorus (circles).

Figure 4 :
Figure 4: Scatterplots of the log-leaf data for the wet and dry seasons for the Sagittaria lancifolia plant data.The solid lines in the left and right frames are the estimated first principal component axes for the raw data and the dashed lines are the first principal component axes for the fitted values.

Table 1 :
Percentage of leaf variability explained by regression models.