Analyzing Collinear Data by Principal Component Regression Approach — An Example from Developing Countries

The aim of this paper is to identify the effects of socioeconomic factors and family planning program effort on total fertility rate with national level data from forty-three developing countries. The data used have mainly been taken from the secondary source “Family Planning and Child Survival: 100 Developing Countries” compiled by the Center for Population and Family Health, Columbia University. Because the independent variables were found to be highly correlated among themselves, component regression technique has been used to analyze the data. The analysis shows that the family planning program effort has the largest contribution in lowering the total fertility rate, followed by percent of urban population, female literacy rate, and infant mortality rate in that order. Policy implications are discussed.


Introduction
One of the severest problems plaguing many of the developing countries is the rapid population growth.Although the World Fertility Survey and the Contraceptive Prevalence Surveys provided evidence of fall in the fertility levels in many developing countries, women surveyed in these countries were still having large families with considerable fertility, as well as, contraceptive use differentials among countries (Population Reports, 1985).The recent fertility decline in some developing countries might lead to the belief that the family planning programs mainly fostered the fertility reduction, and the gap between the fertility levels of the developing and developed countries could be made minimal by the socialization of the family planning services.It is true that the family planning programs exert very strong direct negative effects on fertility (Poston and Baochang, 1987;Cutright and Kelly, 1981;Mauldin and Berelson, 1978;Tsui and Bogue, 1978;Caldwell et al.,2002).However, given that the developing countries themselves differ considerably in terms of socioeconomic development, it may be that the greatest reductions in fertility occurred in those countries that experienced significant socioeconomic development.
The effects of socioeconomic variables on fertility have been demonstrated in a number of studies.Education depresses fertility by increasing the age at marriage, and by increasing the likelihood of contraceptive use (Casteline et al., 1984;Diamond et al., 1997).Other researchers also reported similar depressant effect of education on fertility (Entwisle and Mason, 1985;Rubin-Kurtzman, 1987;Jiang, 1986;Krishnan, 1988;Prada and Ojeda, 1986;Shapiro and Tambashe, 1994;Kravdal, 2002).Place of residence has also been found to be significantly related to fertility: total fertility rates are higher among rural women than among urban women (Alam and Casterline, 1984;Rubin-Kurtzman, 1987;Prada and Ojeda, 1986).Income is negatively related to fertility (Rubin-Kurtzman, 1987;Jiang, 1986).
One important reason for analyzing fertility in developing countries is that there is considerable variability in fertility, socioeconomic development, and family planning behaviour within these countries themselves.Moreover, the relevant data are available for many developing countries.The pivotal question that guided this research is whether, and if so, to what extent, socioeconomic and other developmental factors do induce changes in the national fertility levels, and how do these effects compare to that induced by the family planning program effort.The rationale is that the socioeconomic and other developmental factors do exert independent, as well as, joint influence on the fertility rate, after eliminating the effect of the family planning program effort.The aim of this paper is to identify these factors and their relative contribution towards the variations in fertility level across a number of developing countries for which the relevant data are available.The importance of the study derives from the fact that it is necessary to identify those population groups whose fertility is high but reducible through changes in government policy and a redistribution of available resources.

Data and Variables
This paper analyzed the fertility data of 43 developing countries from Asia, Africa, and Latin America.The data have been obtained from Family Planning and Child Survival: 100 Developing Countries (Ross et. al., 1988) compiled by the Center for Population and Family Health, Columbia University, New York, as well as, from the 1987 World Population Data Sheet (Population Reference Bureau, 1987).The data are shown in the appendix.
The dependent variable is the total fertility rate (T F R : Y ), defined as the number of live births a hypothetical woman would have if she survived to the end of her reproductive period and experienced a given set of age-specific fertility rates.Variables that appeared influential in earlier studies in accounting for fertility variation have been considered as explanatory variables.These variables are: percent of total population living in urban areas (URBAN : X 1 ), percent of population with access to safe water supply (SW AT ER : X 2 ), population per square kilometer (DEN SIT Y : X 3 ), per capita daily calories (CALORIE : X 4 ), percent of female population 15 years old and over who can read and write (F LIT ERACY : X 5 ), family planning program effort score based on four components: policy and stage setting, service, record keeping and evaluation, availability and accessibility (F P SCORE : X 6 ), infant mortality rate i.e., the number of infant deaths per thousand live births (IM R : X 7 ), per capita energy use (EN ERGY : X 8 ), and per capita gross national product (GNP: X9).Ross et al., (1988), andPopulation Reference Bureau (1987) discussed these variables in more details.The analysis was based on 43 developing countries for which data were available for all 10 variables.1 presents the means and standard deviations of the dependent, as well as, of the explanatory variables.The T F R has an average value of 5.5 children per woman varying from lows of 2.3 children in Mauritius, and 2.4 children in Chile to highs of 8.5 children in Rwanda, and 8.0 children in Kenya.We expect negative relationships between T F R and the variables U RBAN, SW AT ER, CALORIE, F L IT ERACY, F P SCORE, EN ERGY , and GN P , while we hypothesize positive relationships between T F R and the variables IM R, and DEN SIT Y .The results of fitting the ordinary least squares (OLS) regression model (where γ 0 is the intercept, and γ i 's are the regression coefficients) connecting the total fertility rate Y and the nine explanatory variables X 1 , X 2 , . . ., X 9 are shown in table 2.
The F value is significant at probability level 0.0001, implying that the variables chosen are valid explanatory variables (Chatterjee and Price, 1977, p.146).The table shows that the value of R 2 is quite large (0.72).This does not, however, imply a good fit (Anscombe, 1973), nor that the model assumptions have not been violated (Chatterjee and Price, 1977).Plots of the standardized residuals against the fitted values, as well as, against the explanatory variables did not show any systematic pattern of variation, and all the standardized residuals fell between +2 and −2.Neither did they detect the presence of any outliers.Consequently, there is no evidence for model misspecification, nor for any serious violations of model assumptions.
Having specified the model properly, we need to see whether multicollinearity could be a problem.Among the nine explanatory variables, two variables -EN ERGY and GN P -are strongly correlated (r = 0.95).Each explanatory variable was then regressed on all other explanatory variables.Two of the eight possible R 2 's from such regressions -R 2 's from regressions of EN ERGY and GN P -were quite large (0.93 and 0.94 respectively).The eigen values of the correlation matrix of the explanatory variables have also been calculated.The smallest of these eigen values is 0.031 which is quite small, and can be taken to be close to zero.The sum of the raciprocals of these eigen values is 48.06 which is greater than five times the number of explanatory variables used.All these indicate the presence of multicollinearity in the data, and as such, the estimates of the model parameters obtained by the OLS regression method are unstable and unreliable.To avoid this problem, we have used an alternative method of estimation -principal component regression (PCR) -which is recommended when multicollinearity is present in the data (Chatterjee and Price, 1977 :175).The PCR method produces estimates which, although biased, have smaller mean square error compared to the estimates provided by the OLS method.
The nine possible eigen values of the matrix of bivariate correlations between pairs of the explanatory variables in descending order are: λ 4 = 0.68244, λ 5 = 0.55685, λ 6 = 0.37898, λ 7 = 0.32980, λ 8 = 0.20487, and λ 9 = 0.03086, and the corresponding eigen vectors (written as row vectors) are: which are linear functions of the standardized explanatory variables with the covariance matrix Our model is Equation ( 2.1) can be written in terms of standardized variables as which is equivalent to where the α's and β 's are related as or conversely (2.5) The variance of Z 9 is λ 9 = 0.03086 which is small and, as mentioned before, can be taken to be approximately zero.This implies that the variable Z 9 is approximately constant, and hence is equal to its mean.Since Z 9 is a linear function of standardized variables x i , Z 9 has a mean zero.It follows that the variable Z 9 is itself approximately zero and is the source of multicollinearity.Let us exclude Z 9 and regress y on Z 1 , Z 2 , . . ., Z 8 .The possible regressions to be considered are Each of these models will lead to estimates of all nine of the original coefficients β i , i = 1, 2, . . ., 9.These estimates will be biased since Z 9 has been excluded in all cases.The inclusion of Z 9 would produce exactly the same estimates as were obtained by using the OLS regression of Y on all the nine explanatory variables given in table 2.
It is to be noted that the regression coefficients in (2.6) can be obtained in a simpler way by exploiting the orthogonality property of Z 1 , Z 2 , . . ., Z 9 without actually performing the regressions.Because of this orthogonality property, α 1 is the same for all k = 1, 2, . . ., 8. Similarly, α 2 is the same for all k = 1, 2, . . ., 8. The same is true for other α's.Then using the standardized estimates, denoted as β (9) 's, based on all the nine principal components (column 18 of table 3 which is the same as the last column of table 2), we can obtain the estimates of the α's by using the equations (2.4).In order to obtain principal component regression estimates of the β 's corresponding to equations in (2.6) we can refer back to equations (2.5) and set the appropriate α's to zero.Using the standardized estimates β (9) 's from the last column of table 3 in equations (2.4), we have the corresponding estimates of α's as α 1 = −0.3160,α 2 = −0.3229,α 3 = 0.2299 α 4 = −0.0724,α 5 = 0.1336, α 6 = −0.2200α 7 = −0.4091,α 8 = −0.0160,α 9 = −0.0775.
The table shows that the difference in results obtained by using different numbers of principal components are quite substantial.As was mentioned before, the estimates in the last PCR equation (columns 17 and 18) involving all the nine possible principal components are the same as the OLS estimates, and as such, will not be considered.In other words, we need to choose one from among the other eight PCR equations.The criteria used here for choosing the best PCR equation are the stability of the coefficients, amount of information used, and percentage of variation explained.Let us consider equations (2.6) with k = 7 (columns 13 and 14), and k = 8 (columns 15 and 16).It is clear that the first seven principal components are associated with the combined effect of X 1 (URBAN), X 5 (F LIT ERACY ), and X 6 (F P SCORE).These coefficients remained almost the same after adding the eighth principal component.The coefficients of other variables also remained almost the same.
The table shows that no other pair of equations reveals as much overall stability as equations with k = 7 and k = 8 in (2.6).Each of these equations also explains 72 percent of the variation in the total fertility rate -almost the same amount as the OLS method explained.Thus, any of these two equations could be chosen.However, the equation with k = 8 is preferred since it uses eight principal components, and, therefore, uses almost all of the total amount of information contained in the data.Hence, the use of the equation with k = 8 is expected to reduce the amount of bias that will inevitably creep into the estimates.Thus, we select the model based on the first eight principal components.In terms of the original variables, therefore, In order to evaluate the relative importance of the explanatory variables in determining the total fertility rate, the standardized coefficients are examined (table 3: column 16).The table shows that the impact of the family planning program effort, as measured in standard deviation units, is the largest in lowering the total fertility rate, followed by the percent of urban population, female literacy rate, and infant mortality rate.Indeed, the impact of the family planning program effort is more than one and a half times greater than that of percent of urban population, more than two and a half times greater than that of female literacy rate, and more than five times greater than that of infant mortality rate.The unstandardized coefficients show that a unit increase in the family planning program effort score decreases the number of children per woman by 0.033, a one percent increase in the urban population is associated with a decrease of 0.031 children per woman, a one percent increase in the female literacy rate is associated with a decrease of 0.012 children per woman, and an increase of infant mortality by one per 1000 live births is associated with an increase of 0.004 children per woman.
It is to be noted that six of the nine explanatory variables have the hypothesized directions of relationships with the total fertility rate.It is difficult to interpret the inverse relationships of the variables SWATER, CALORIE, AND DENSITY.Whether directions of such relationships, counter to our expectation, will still persist after inclusion of other developing countries into the analysis, remains to be seen.

Summary and Conclusions
Although the recent decades have witnessed declines in fertility rates in developing countries, they are still at much higher levels than in developed countries.Given the swarm of problems associated with the rapid population growth, it is important to analyze the determinants of fertility in developing countries to identify their relative weights necessary for ascertaining priorities while formulating population policies.
The cross-national variation in total fertility rate has been analyzed in this paper using principal component regression technique with national data for 43 developing countries.The explanatory variables used are: percent of population living in urban areas, percent of population with access to safe water supply, population density, per capita daily calories, female literacy rate, family planning program effort score, infant mortality rate, per capita energy use, and per capita gross national product.
The analysis shows that the family planning program effort has the highest impact on the total fertility rate, followed by percent of urban population, female literacy rate, and infant mortality rate in that order.Indeed, the effect of the family planning program effort is more than one and a half times greater than that of the percent of urban population, more than two and a half times greater than that of the female literacy rate, and more than five times greater than that of the infant mortality rate.
This study has a number of policy implications.The family planning program effort is the most important contributor to the reduction of total fertility rate.This lends support to the contention that the determinative factor that has fostered the recent decline in fertility in the developing countries has been mainly the governments' family planning programs.
The second most important variable is the percent of urban population: the higher is this percentage the lower is the total fertility rate.In developing countries larger segments of the populations live in rural areas.Urban areas are usually the centers of political and economic power and a great part of resources and social services is concentrated in them.As such, people living in urban areas enjoy relatively more opportunities of modern life which are conducive to having smaller family size.
The third important variable is the female literacy rate.The higher is this literacy rate the lower is the total fertility rate.Better educated women enjoy better access to opportunities of life, and hence lower fertility is felt more advantageous to them than higher fertility, since with lower fertility it is easier to reap the benefits of those opportunities.Among women with no education, even significant difference in the number of children fails to make any observable difference in the level of living, and as a result lower fertility does not appear to them as a favourable life condition.As such, societies with lower levels of literacy have greater likelihoods of having larger fertility rates.
The last important variable is the infant mortality rate -the higher is the infant mortality rate the higher is the fertility rate.Many studies have obtained results supportive of the positive effect of infant and child mortality on fertility (Adlakha, 1973;Taylor, Newman, and Kelly, 1976).The idea is conceptually related to the child survival hypothesis.Experience with, or fear of infant and child mortality might make married couples have 'extra' births to replace young children who already died.Another possibility is that couples might adopt mod-ern contraception only when they are confident their fertility goal will be reached and not eroded by child mortality.As such, societies with higher infant mortality tend to have higher fertility.
Thus, although the family planning programs have played the most important roles in recent declines of fertility levels in developing countries, these declines should not be viewed as due, solely, to successful family planning programs.The results of this analysis indicate that an egalitarian distribution of the benefits of socioeconomic development over rural and urban areas, an increase in the level of female literacy, and a decrease in the level of infant mortality may be important strategies for reducing the fertility rates in developing countries.
Data used in the analysis can be obtained from the web version of JDS.

Table 1 :
Means and standard deviations of total fertility rate, and nine explanatory variables: 43 developing countries from Asia, Africa, and Latin America.

Table 2 :
Unstandardized and standardized coefficients of regression of total fertility rate on the nine explanatory variables.

Table 3 :
Principal component regression results for total fertility rate: Data for 43 developing countries.