Regression Analysis of Collinear Data using rk Class Estimator : Socio-Economic and Demographic Factors Affecting the Total Fertility Rate ( TFR ) in India

A basic assumption concerned with general linear regression model is that there is no correlation (or no multicollinearity) between the explanatory variables. When this assumption is not satisfied, the least squares estimators have large variances and become unstable and may have a wrong sign. Therefore, we resort to biased regression methods, which stabilize the parameter estimates. Ridge regression (RR) and principal component regression (PCR) are two of the most popular biased regression methods which can be used in case of multicollinearity. But the r-k class estimator, which is composed by combining the RR estimator and the PCR estimator into a single estimator gives the better estimates of the regression coefficients than the RR estimator and PCR estimator. This paper explores the multiple regression technique using r-k class estimator between TFR and other socio-economic and demographic variables and the data has been taken from the National Family Health Survey-III (NFHS-III): 29 states of India. The analysis shows that use of contraceptive devices shares the greatest impact on fertility rate followed by maternal care, use of improved water, female age at marriage and spacing between births.


Introduction
In developing countries, overpopulation is considered to be one of the most basic cause of underdevelopment.The developing countries already facing a lack in their resources, and with the rapidly increasing population, the resources available per person are reduced further, leading to increase poverty, malnutrition, and other large population related problems.Given this situation, the governments of developing countries, along with non-government organizations, are trying to address this problem by conducting research on the determinants of fertility.India is also dealing with this acute problem, which tends to nullify most of the efforts to encourage development.The government of India has been organizing several programs for controlling the population increase and has been investing the lot of money for controlling the birth rate.Some of the programs have been successful and the rate of increase has also reduced, but has still to reach the sustainable rate.A question of concern to demographers and other social scientists is whether this decline in fertility has been fostered mainly by the family planning programs.Indeed, this reduction in fertility has in some cases led to the belief that the gap between the fertility levels of the different states of India can be substantially reduced by the socialization of family planning services.Available evidence, however, showing that the India has considerable fertility as well as contraceptive use differentials among the various states.These differentials can well be attributed to the fact that socio-economic factors are often differentially distributed across social groups that exists in a society or between societies.Moreover, given that various states of India differ considerably in terms of socio-economic development, it may be that greatest reduction in fertility in those states that experienced significant socio-economic development.
The effect of socio-economic factors on fertility have been examined in a number of studies.Education depresses fertility by increasing the age at marriage, and by increasing the likelihood of contraceptive use (Casteline et al., 1984;Diamond et al., 1997).Other researchers also reported similar depressant effect of education on fertility (Entwisle and Mason, 1985; Rubin-Kurtzman, 1987; Jiang, 1986;Krishnan, 1988;Prada and Ojeda, 1986;Shapiro and Tambashe, 1994;Kravdal, 2002).Place of residence has also been found to be significantly related to fertility: total fertility rates are higher among rural women than among urban women (Alam and Casterline, 1984;Rubin-Kurtzman, 1987; Prada and Ojeda, 1986).Income is negatively related to fertility (Rubin-Kurtzman, 1987; Jiang, 1986).
Since Total Fertility rate is the most important measure of fertility in demography and TFR is affected by many socio-economic and other development factors.Hence the main objective of this paper is to know that up to what extent and how the socio-economic and other development factors impact the fertility level of India.It is believed that socio-economic and other development factors do exert significantly independent as well as the joint impact on fertility after eliminating the effect of the family planning programs and policies.An attempt has been made in this paper to identify these factors and their relative contributions towards the variation in the fertility level of India.The importance of the study derives from the fact that it is necessary to identify those population groups whose fertility is high but reducible through changes in government policies and the redistribution of available resources.

Data and Method
The dependent variable is the total fertility rate (defined as average number of children a woman has in her lifetime).Total fertility rate is affected by many demographic, social, cultural, and economic variables.The explanatory variables considered in the present study are those, that appeared influential in fertility variation.These variables are Human development index (HDI = X 1 ), infant mortality rate (IMR = X 2 ), defined as infant deaths per thousand live births, percent of population using contraceptive devices (any method = X 3 ), median age at marriage of male (= X 4 ), median age at marriage for female (= X 5 ), median number of months since preceding birth (= X 6 ), percent of population using improved water for drink (= X 7 ), male literacy rate (= X 8 ), female literacy rate (= X 9 ) and percent of mothers who are taking maternal care (= X 10 ).Here, the independent variables that we have considered are discrete as well as of continuous in nature for e.g., X 4 , X 5 and X 6 are continuous variables while the others are discrete in nature.But we have confined ourselves for the integral values of the age at marriage and birth intervals.But it can be considered as the continuous case (Sufian, 2005).In the analysis, the data on the several variables was taken from National Family Health Survey-III (NFHS-III) about 29 Indian states.National family health survey is the nationwide sample survey which consider the following sampling design and techniques of data collection.
Sample Design: The urban and rural samples within each state were drawn separately and, to the extent possible, the sample within each state was allocated proportionally to the size of the state's urban and rural populations.A uniform sample design was adopted in all the states.In each state, the rural sample was selected in two states: the selection of primary sampling units (PSUs), which are villages, with probability proportional to population size (PPS) at the first stage, followed by the random selection of households within each PSU in the second stage.In urban areas, a three-stage procedure was followed.In the first stage, wards were selected with PPS sampling.In the next stage, one census enumeration block (CEB) was randomly selected from each sample ward.In the final stage, households were randomly selected within each sample CEB.Each ward comprises several enumeration blocks (CEB) created for the census.A list of all the CEBs in a selected ward formed the sampling frame at the second stage.Such lists of CEBs in the selected wards were made available for use for NFHS-III by the census office on request.Each CEB is comprised of about 150-200 households.
Sample Selection: In rural areas, the 2001 Census list of villages served as the sampling frame.the list was stratified by a number of variables.The first level of stratification was geographic, with districts being subdivided into contiguous regions.Within each of these region, villages were further stratified using selected variables from the following list: village size, percentage of males working in the non-agricultural sector, percentage of the population belonging to scheduled castes or scheduled tribes, and female literacy.In addition to these variable, HIV prevalence status, i.e., "High", "Medium" and "Low" as estimated for all the districts in high HIV prevalence states, was used for stratification in the high HIV prevalence states.Female literacy was used for implicit stratification (i.e., the villages were ordered prior to selection according to the proportion of females who were literate) in most states although it may be an explicit stratification variables in a few states.
The mean and standard deviation are given in the table below.Table 1 presents the means and standard deviations of the dependent as well as explanatory variables.State-wise TFR is taken in data and then the mean and standard deviation has been computed.The TFR has an average value of 2.62 children per woman varying from lows of 1.79 children in Goa and Andhra Pradesh, 1.8 children in Tamil Nadu, 1.94 children in Himachal Pradesh and many other states also highs of 4.2 children in Bihar, 3.8 children in Meghalaya and 3.6 children in Jharkhand.We hypothesize that contraceptive use, female and male age at marriage, birth interval, use of improved water, HDI, female and male literacy rate and maternal care will be negatively related to the total fertility rate while positive relationships are expected between TFR and each of the IMR.
The most commonly used estimator for the estimation of parameters is the ordinary least square (OLS) estimator.Under certain assumptions, least square method produce estimators with desirable properties.In some instances (e.g., when one or more assumptions do not hold) other estimators may be superior to ordinary least square (OLS).The other estimators are maximum likelihood, ridge, principal components and r-k class estimator.
The "n" observations for the dependent variable Y are determined by (2.1) can also be written as where Y is the response variable, i.e., TFR and X 1 , X 2 , • • • , X 10 are the predictor variables, β 0 is the intercept term.It gives the mean or average effect on Y of all the variables excluded from the model and βi's are partial regression coefficients or the slope parameters describing the relation between the response and predictor variables, on the other hand partial regression coefficients measures the change in the mean value of Y corresponding to per unit change in X j , when all other predictor variables are held constant.Consider the standard matrix form of the above multiple linear regression model where X = (x ij ) is a fixed n × p + 1 matrix.[(x ij ) is the i th observation on the j th independent variable] and is of full rank p (p ≤ n), Y = (y i ) is an n × 1 vector of observations on the dependent variables, β is a p + 1 × 1 unknown column vector of regression coefficients, and ε = (ε i ) is an n × 1 vector of random errors; Let us assume that the variables have been standardized by subtracting their sample means and dividing by their sample standard deviations.Then the model given in (2.3) will be (2.4) Now, we wish to estimate the p × 1 vector β of regression coefficients.The variables are assumed to be standardized so that X X is in the form of correlation matrix, and the vector X Y is the vector of correlation coefficients of the dependent variable with each explanatory variable.
The least squares (LS) estimator, β of the parameters are given by β = (X X) −1 X Y. (2.5) Here the assumption for ordinary least square (OLS), 1. X is set of fixed numbers.
2. X is full column rank matrix, i.e., rank of X should be p.

Predictor variables
OLS has been treated as the best estimator for a long time.However, many results have proved that the OLS estimator is no longer a good estimator when the multicollinearity is present (Al-Hassan, 2008).In multiple linear regression models, we usually assume that the explanatory variables are independent.However, in practice, there may be strong or near to strong linear relationships among the explanatory variables.In that case the independent assumptions are no longer valid, which causes the problem of multicollinearity.Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly linearly related or correlated.If our goal is simply to predict Y from a set of X variables, then multicollinearity is not a problem.The predictions will still be accurate, and the overall R 2 quantifies how well the model predicts the Y values.If our goal is to understand how the various X variables impact Y , then multicollinearity is a big problem.In the presence of multicollinearity, it is impossible to estimate the unique effects of individual variables in the regression equation.Multicollinearity increases the standard errors of the coefficients.Increased standard error means that coefficients for some independent variables may be found insignificant, whereas without multicollinearity and with lower standard errors, these same coefficients might have been found to be significant.Moreover, the LS estimates are likely to be too large in absolute value and possibly, of the wrong sign (Al-Hassan, 2008).Therefore, multicollinearity becomes one of the serious problems in the linear regression analysis.Multicollinearity only affects calculations regarding individual predictors, i.e., a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

Diagnosing Multicollinearity
In order to lay the foundation for detection of multicollinearity problem, some classic symptoms are present in our data: • The F is highly significant (p-value-0.000),implying that the variables are chosen are valid explanatory variables (Chatterjee and Price, 1977, p. 146) and most of the regression coefficients are insignificant at 5% level of significance, which can be seen from the Table 2.
• The value of R 2 is quite large, i.e., 0.899.
• Variance inflation factor (VIF) of HDI and Female literacy rate is greater than 10, i.e., 10.337 and 11.019 respectively.
• Sometimes eigenvalues, condition indices and the condition number can be referred in examining multicollinearity.The condition number (k) is given as the square root of the largest eigenvalue (max(λ)) divided by the smallest eigenvalue (min(λ)), i.e., k = max(λ) min(λ) .
In our case, k = 11.406, when there is no collinearity at all, the eigenvalues, condition indices and condition number will all equal to one.As collinearity increases, eigenvalues will be both greater and smaller than 1 (eigenvalues close to zero indicate a multicollinearity problem), and the condition indices or the condition number will increase.From the Table 3, we can see that how the explanatory variables are correlated.Among the explanatory variables, HDI and IMR are highly negatively correlated (−0.89), correlation between HDI and Female age at marriage is −0.74 and there is high positive correlation between the female literacy rate and HDI (0.89) and similarly, we can see the correlation between the all other predictor variables.A correlation is the measurement of the relationship between two variables.A positive correlation is a direct relationship, as the amount of one variable increases, the amount of a second variable also increases.And in a negative correlation, as the amount of one variable goes up, the levels of another variable goes down.One of the other way to check the multicollinearity is that if sum of the reciprocals of the eigenvalues is greater than five times of the number of predictor variables used then there is multicollinearity in the data.And in this data, sum of reciprocals of the eigenvalues is 64.46 which is greater than five times the number of predictor variables (10) used [47].All these indicates the presence of multicollinearity.And in case of presence of multicollinearity, the estimates obtained by OLS estimator are not reliable and desirable if we want to know that how predictor variables (X) impacts on response variable (Y ).
Several methods have been suggested to solve this problem.Ridge regression (RR) and principal component regression (PCR) are two of the most popular biased regression methods that help to discuss the problem of collinearity in the data and provide the better solution of the problem.
1. Ridge Regression (RR): Hoerl and Kennard (1970a) suggested the use of X X + kI p , (k ≥ 0) rather than X X, in the estimation of β (2.5).The resulting estimators of β are known in literature as the RR estimator, given by β(k) = (X X + kI p ) −1 X Y. (2.6) The constant k is known as biasing or ridge parameter.As k increases from zero and continues up to infinity, the regression estimates tend toward zero.Though these estimators result in bias, for certain value of k, they yield minimum MSE compared to the LS estimator (Hoerl and Kennard, 1970a).However, the MSE( β(k)) will depend on unknown parameters k, β and σ 2 , which cannot be calculated in practice.But k has to be estimated from the real data instead.Several methods for estimating k have been proposed and evaluated by several researchers.Some of these researchers are Hoerl Jolliffe (1986), Jackson (1991) and Basilevsky (1994).Other reviews are by Rao (1964), Jackson (1980;1981), Wold et al. (1987), Duntman (1989) (Rencher, 1998) and Jolliffe (2005).As we have indicated, an approach to the problem of multicollinearity is PCR, in which Y is regressed on the principal components of X's .If we use only the larger principal components, the large variances in βj 's due to multicollinearity are reduced, but of course we introduce some bias in the new βj 's.Often, the principal components with the highest variance are selected.However, the low variance principal components may also be important, and in some cases, they may even more important than those with the highest variances (Jolliffe, 1982).
Let T be the (p×p) orthogonal matrix, i.e., T = (t 1 , t 2 , • • • , t p ) such that it diagonalizes X X, i.e., T X XT = Λ = diag(e 1 , e 2 , • • • , e p ) where T T = Ip = T T , being diagonal matrix consisting of eigenvalues of X X as its diagonal elements.
(2.4) can be written as on pre-multiplying by T both sides, we have After deleting (p − r) columns of T , T r be the remaining eigenvectors of X X so that T r X XT r = Λ r then from (2.9), the reduced model will be here T r will be of p × r matrix of eigenvectors.The purpose of principal components is to generate a reduced set of variables that account for most of the variance of the original variables.We must therefore decide just how many components to retain; other components will be discarded.In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analysed.However, Mansfileld et al. (1977) suggested that only the first few components account for meaningful amounts of variance, so only these first few components are retained and used in multiple regression analyses.Jolliffe (1982) represents the point of view of many statisticians whose decisions depend only on the magnitude λ of the variance of the principal component.
The eigenvalues of the correlation matrix of the predictor variables have also been calculated.Which are given by λ 1 = 5.9718, λ 2 = 1.5576, λ 3 = 1.0881, λ 4 = 0.4742, λ 5 = 0.2966, λ 6 = 0.2705, λ 7 = 0.1258, λ 8 = 0.0929, λ 9 = 0.0766, λ 10 = 0.0459.The above figure shows the percent of variation explained by the principle components.From this figure we can conclude that first principle component is showing 59.7% variation, second principle component is showing 15.56% and so on.First four principle components are showing 90.91% variation therefore we will consider only first four principle components in our model.The first four eigenvalues are λ 1 = 5.9718, λ 2 = 1.5576, λ 3 = 1.0881 and λ 4 = 0.4742 and the corresponding eigenvectors are given in the matrix T r below: And by using the Geometric mean method given in (2.8), the value of k is calculated, i.e., k = 2.723501009.
PCR was first proposed by Hotelling (1957) and Kendal (1957).Hsuan (1981) explored the relationship between PCR and RR.He proved that when the data are severely multi-collinear, the ridge estimator can be made very close to the principal components estimators.Baye and Parker (1984) and Nomura and Ohkubo (1985) proposed the r − k class estimator by combining the RR estimator and the PCR estimator into a single estimator, which performs better than the other estimators while dealing with multicollinearity (Sarkar, 1989).The r − k class estimator can be written in the form: where T r is the matrix of eigenvectors, X and Y is the standardized matrices of explanatory and response variables respectively and I is the (r × r) identity matrix.Now from the above r−k class estimator given in (2.11), we can easily estimate the regression coefficients.Which are given in the Table 4 below: (2.12)

Conclusion
India, being a developing country, has to face several socio-demographic challenges.One of the most important problem is the population explosion or the high birth rate.There are lot of problems associated with high birth rate.High birth rates can cause stress on the government welfare and family programs to support a youthful population.Additional problems faced by a country with a high birth rate include educating a growing number of children, creating jobs for these children when they enter to the workforce, and dealing with the environmental effects that a large population can produce.Several solutions to decrease the rate of population increase has been tried by the government of India, some successfully, some unsuccessfully.Although the rate of increase has decreased up to some extent but the rate has not reached to the satisfactory level yet.The population of India continues to increase at an alarming rate.The effects of this population increase are evident in the increasing poverty, unemployment, air and water pollution, shortage of food, health sources and educational resources.Thus it is important to analyse the determinants of fertility in India to identify their relative weights necessary for ascertaining priorities while formulating population policies.
In order to evaluate the relative importance of the explanatory variables in determining total fertility rate, the standardized variables have been used.The resulted regression model given in (2.12) support the conclusion that use of contraceptive devices (any method) is very useful factor that has the highest impact to decrease the total fertility rate.The family planning policies and programs are the most important contributor in reduction of fertility rate.This lends support to the contention that the determinative factor that has fostered the recent decline in fertility in India has been mainly by the government's family planning programs.
The other variable significantly related to the total fertility rate is the maternal care.A healthy, relaxed mother would be more likely to have a positive effect on the well-being of the new born.If there is no care of mother then obviously the child will be very weak and the chances of the infant mortality will be higher.And the need of children make the higher fertility rate.Many studies have obtained results supportive of the positive effect of infant and child mortality on fertility (Adlakha, 1973;Taylor, Newman and Kelly, 1976).The idea is conceptually related to the child survival hypothesis.Experience with, or fear of infant and child mortality might make married couples have extra births to replace young children who already died.As such, societies with higher infant mortality tend to have higher fertility.Thus, the overall purpose to reduce the fertility rate is to make an improvement in mother and child health.
In developing country like India, the use of improved water play a very important role in the human fertility.Here in our case, its role looks vital for deciding the TFR for the women in the period of reproductive age.We see that TFR decreases with the use of improved water slightly more than the other negative factors like female age at marriage, birth interval and female literacy rate.Thus it is quite interesting to analyse the role of use of improved water on human fertility, which is very important factor for civilize society.
The fourth important variable known to influence the fertility performance of women is the female age at marriage, in the sense that if the female age at marriage is low, women start having their children at an early age, and these children, in their turn, begin to procreate early.By rising the age at marriage, specially for women, we cut down on their reproductive span and thus reduce fertility.
The role of education is widely believed to be central to major changes in fertility rate in India and elsewhere.Generally, having a higher level of education is associated with later and less childbearing and higher-educated women are more likely to have higher earning husbands or partners, so providing a further positive "income effect" on childbearing.Education also provides an opportunity to participate in gainful employment outside the home, and this competes with the demands of childbearing.Better educated women enjoy better access to opportunities of life, and hence lower fertility is felt more advantageous to them than higher fertility, since with lower fertility it is easier to reap the benefits of those opportunities.Thus an educated woman is very likely to prefer a smaller family.Among women with no education even significant difference in the number of children fails to make any observable difference in the level of living.As such, societies with lower level of literacy have greater likelihoods of having larger fertility rates.Education exposes a woman to a wide range of information regarding birth control and family planning and decreases the total fertility rate.
Birth interval also impact the fertility.The model says that birth interval is also the factor that decreases the total fertility rate but at very low level.This indicates that in India, the spacing between the two birth is still low.
All these factors have implications for their fertility performance.Thus, although the family planning programs have played the most important roles in declining fertility, this decline should not be viewed as due, solely, to successful family planning programs.The results of this analysis indicate that an egalitarian distribution of the benefits of socio-economic development over rural and urban areas, maternal care, an increase in the level of female literacy, decrease in the level of infant mortality, use of improved water for drink, age at marriage and spacing between birth may be important strategies for reducing the fertility rates in India.But the raising of age at marriage will have an impact on fertility only when the law relating to it is uniformly enforced throughout the country.

Table 1 :
Means and standard deviations of total fertility rate and ten predictor variables: 29 states of India

Table 3 :
Correlation matrix of predictor variables the eigenvalues of the correlation matrix of explanatory variables and e 1 , e 2 , • • • , e p are orthogonal eigenvectors corresponding to the eigenvalues.Orthogonal means