Influence Diagnostics in Two-Parameter Ridge Regression

Abstract: Identifying influential observations is an important part of the model building process in linear regression. There are numerous diagnostic measures based on different approaches in linear regression analysis. However, the problem of multicollinearity and influential observations may occur simultaneously. Therefore, we propose new diagnostic measures based on the two parameter ridge estimator defined by Lipovetsky and Conklin (2005) alternative to the usual ridge regression and ordinary linear regression. We define two parameter ridge-type generalizations of DFFITS and Cook’s distance. Moreover, we obtain approximate case deletion formulas and provide approximate versions of new measures. Finally, we illustrate the benefits of proposed measures in real data examples.


Introduction and Motivation
Consider the usual multiple linear regression model with intercept which is defined as where X is an nt  data matrix centered and standardized, y is an 1 n response vector, 0  is an unknown scalar parameter, 1  is a 1 t  vector of unknown coefficients and  is an 1 n random error vector following normal distribution     is the residual vector and H is the hat or projection matrix having diagonal elements   We refer to a couple   , ii yz as a case, as suggested by Cook and Weisberg (1982).
Not all data points in a data set have the same importance in determining the coefficient estimates, t-values and some other statistics.Some points may affect the analysis or estimates remarkably.Therefore it is crucial to detect this kind of points in the analysis.This process is called as influence diagnostics.
However, influential observations and the problem of multicollinearity may occur at the same time.Belsey et al. (1980) stated that using biased estimators to overcome multicollinearity may affect influence of some cases.The most common method of detecting influential observations is to use single-case deletion approach as Cook (1977) did.Therefore, Walker and Birch (1988) defined a ridge regression (Hoerl and Kennard, 1970) scheme with case deletion method and obtained the approximate case deletion formulas for the detection of influential cases and proposed ridge generalizations of Cook's distance (Cook, 1977) and DFFITS (Belsey et al., 1980) which are the most commonly used statistics based on the case deletion method in the ordinary linear regression.
After Walker and Birch (1988), generalized versions of Cook's distance and DFFITS of some biased estimators used for combating multicollinearity have been defined, for example, Jahufer and Jianbao (2009)  The purpose of this paper is to introduce new influence diagnostics based on a two-parameter ridge estimator defined by Lipovetsky and Conklin (2005) and obtain generalized versions of Cook's distance and DFFITS and approximate case deletion formulas for this estimator.
The organization of the paper is as follows: We give some quick background information and review the influence measures in ordinary linear regression in section 2. In section 3, we introduce new diagnostic measures in two-parameter ridge regression and obtain case deletion formulas.Applications of real data sets are illustrated in section 4.

Background Information
The main purpose of influence analysis is to measure the changes occurred in a defined aspect of the research when there is perturbation in the data.As we mentioned, one approach is to use case omission perturbation technique.We follow this technique throughout this article and we assume that the reader is familiar with the basic concepts of leverages and influence analysis in ordinary least squares.
Although there are various types of single case diagnostic methods, one popular method is the difference in fit standardized called DFFITS (Belsey et al., 1980) ,which is the standardized change in the fitted value of a case when it is deleted, can be evaluated at the th i case as where () ˆi  is the OLS estimator of  when the th i case is deleted and is the OLS estimator of  without the th i case.
Another popular and useful measure is Cook's distance (Cook and Weisberg, 1982), which is a measure of the change in the fitted values when the th i case is deleted, is defined by where It is observed from the above measures that the influence of a case can be interpreted as a function of residuals and leverages.Moreover, it is important to emphasize that these measures are useful for exploring the individual or single influential cases.Shi and Wang (1999) stated that measures based on the case deletion method may suffer from the masking problem which occurs in the presence of another influential cases.i D detects the case causing the most change in the estimates when it is deleted, moreover, i DFFITS also considers the effect on the estimates of variance 2 s (Brown and Lawrance, 2000).
If the values of i D and i DFFITS exceed some well-defined cutoff points, then it is said that the th i observation is influential.However, the cutoff points for these measures are not clear.Cook and Weisberg (1982) (Belsey et al., 1980).However, it is important to note that these influence measures are only useful for identifying single cases with high-influence.

Two-Parametere Ridge Estiamator(TPR)
When the explanatory variables are correlated to each other, the variance of unbiased OLS estimator becomes inflated so that we cannot make stable estimations.Therefore, there are various studies proposing biased estimators in literature.Among them, ridge estimator (RE) Hoerl and Kennard (1970) and Liu estimator (Kejian,   1993) In this study, we consider a two-parameter ridge estimator (TPR) defined by Lipovetsky and Conklin (2005).Although RE is a popular estimator, its quality of fit is worse than OLS and does not satisfy the orthogonality assumptions.Therefore Lipovetsky and Conklin (2005) obtain TPR, a generalization of RE to two parameter model, considering a simultaneous minimization of the model errors, deviations from orthogonality between regressors and errors and deviations of the solutions from the pairwise regressions.Now, let us denote the objective function of the sum of squared errors of OLS as follows: R estimates the quality of the model such that 22 1 2 .
We can describe the relation (3.2) as 0 expressing the orthogonality of each regressor (column of Z ) to the error vector.Similarly, we can obtain the objective function of RE as Minimizing (3.5) gives us RE as We can conclude that RE does not satisfy the orthogonality assumption (3.4).Therefore Lipovetsky and Conklin (2005) constructed a multi-objective least squares for a regression as follows: Minimizing the equation (3.6) yields the following matrix equation .
Taking 2 0 q  and after some algebraic calculations (see Lipovetsky and Conklin (2005) for details), TPR is obtained as follows: where the parameter q is chosen to maximize the function of regression fit which is given by which can be obtained by using the equations (3.8) and (3.3).The optimal value of the parameter q is computed by which is always bigger than 1.The authors also claimed that all the orthogonality assumptions hold for TPR.

Leverage and Residuals in TPR
Using TPR given in (3.8), we can obtain the vector of fitted values is the hat matrix for TPR, plays the same role as the hat matrix H of ordinary least squares.We can interpret the th i fitted value in terms of the elements of ( , ) We can see the diagonal elements   , ii h k q as the leverages for TPR regression as in the least square.Note that the matrix ( , ) H k q is not idempotent, thus it is not a projection matrix.
Furthermore, we can consider canonical reduction by applying the singular value decomposition (SVD) (Mandel, 1982) where  is a diagonal matrix consisting of eigenvalues 12 ...
u  is the projection of th i row i z onto the th j eigenvector of Z .By using the SVD theorem, th i leverage of TPR can be written as follows: .
We observe from the above result that if q approaches to one when k is fixed, then where min  and max  are the minimum and maximum eigenvalues of The th i residual of TPR is also given by ii i e k q y y k q y z k q h k q y   


We conclude similar results such that if q approaches to one when k is fixed, then

 
, i e k q approaches to the ridge residuals   i ek.If k approaches to zero when 1 q  , then   , i e k q values goes to the OLS residuals i e .

Cook's Distance and DFFITS in TPR
We define a new version of DFFITS for TPR as where     It is better to express these new measures as functions of leverages and residuals.However, this is not possible because of the scale dependency of the TPR estimator.Since the TPR estimator is not scale invariant,   Zi , the Z matrix without the th i row, is needed to be rescaled before computing     ˆ, i kq  .In the following subsection, we provide some approximate case deletion formulas to obtain the approximate versions of these measures.

Approximate Case Deletion Formulas for TPR
is the TPR estimator without the th i case,   yi is the response vector without the th i element.We can write     ˆ, i kq  in the following form: Now, we apply Sherman-Morrison-Woodbury (SMW) theorem (Rao, 1973) to the matrix and obtain     ˆ, i kq  as follows: Where Thus we obtain the following difference formula: Based on the above result, we present the approximate versions of (3.11), (3.12) and (3.13) respectively as follows:

Numerical Examples
In this section, we illustrate an application of new influence statistics to the widely investigated data set used by Longley (1967).There are 16 observations of response variable as total derived employment and 6 predictors namely, GNP implicit price deflator   This data set has been used to identify influential observations by Cook (1977), Walker and Birch (1988), Jahufer and Jianbao (2009), and Ullah et al. (2013) and some other authors.To be consistent with these papers, we use the model (1.1) with the following notations:  where 1 is a vector of 16 ones and is centered and standardized so that 11 XX  is in correlation form.We use the matrix as the design matrix.Thus, we use   diag 0,1,1,...,1 as the identity matrix as used in Walker and Birch (1988).
We used the Matlab program to compute all of the given information, so there may be some differences between our results and the literature.The condition number of the matrix Z is computed as max min / 42473

   
which shows that there is strong multicollinearity problem with this data set.Cook (1977) considered this data set and found the cases   q for different values of the parameters k and q and identify the most influential observations as given in Table 1.We provide the values regarding the observations whose   , (Kibria, 2003).
According to Table 1, it is observed that the same observations that Cook (1977) identified as influential cases are detected as influential observations in a different order as   16,10,5,4,15  We also give some plots to summarize the results easily.In Figures 1-2   In Figure 3, we provide the plot of hat diagonals versus observations.According to Figure 3, although the observations having the first three highest leverages are 16 th , 2 nd and 8 th observations are not detected as influential cases.Thus, we can say that high leverage points may not be influential all the time.The Figure 4 is the plot of residual against the observations.According to this figure,   10, 16, 6, 4, 1 are the observations having largest residuals.Moreover, 10 th observation has the largest residual; however it is not the most influential observation.Thus, we can say in a similar manner that having a larger residual does not guarantee to be the most influential observation.In the last figure, we provide the plots of the distance values of a i DFFITS for changing values of q between 0 and 1 when 3 k is used.All distance values of influential observations are increased slowly, however, all distance values remain smaller than 1.Finally, we consider the following data sets and obtain their influential observations and distance values using the new methods: Tobacco data (Myers, 1986), Hald data set (Hald, 1952), body fat data set (Neter et al., 1997) and crime rate data set (Agresti and Finlay, 1986).(Cook, 1977), (Ullah, et al., 2013)].Thus, it is showed that new diagnostics defined in two-parameter ridge estimator are successful to determine the influential observations of the data sets used in literature.

Conclusion
In this article, we consider the problem of multicollinearity and influential observations together and propose new diagnostic measures using a two-parameter ridge estimator.In order to obtain the approximate versions of new diagnostic measures, we present the approximate case deletion formulas in two-parameter ridge regression using SMW theorem.
Moreover, we illustrate an example of real data application using Longley (1967) data.The numerical results show that new measures are useful to identify influential observations.However, we suggest to the practitioners that it is important to use these measures along with the knowledge and expertise such that he/she needs to decide whether the identified case should be retained, removed or down weighted.
obtained global influential observations by using a modified ridge regression scheme, Ullah et al. (2013) defined Liu versions and Ertas et al. (2013) obtained Liu and modified Liu versions of the mentioned single case diagnostics.
the correlation forms.The minimization of (3.1) is satisfied by the following normal system of equations .

Z
and the columns of the matrix V are the eigenvectors of

3 x 4 x 5 x
, size of armed forces   , non- institutional population 14 years of age and over   and the time   6x .
the cut-off value which is computed as 1.0445 and the five observations having largest Cook's distances used four different estimators of the parameter k chosen from the literature and the optimal value of the parameter obtained by using (3.10) to minimize the mean squared error function as follows: observations are plotted only for the estimator in order to make it easy to observe the influential cases from these figures.According to Figure1 and 2, the most influential cases are the 16th and 10th cases.results.These results are consistent with literature.

Figure 1 :
Figure 1: Plot of Cook's distance according to different two approach using k3

Figure 2 :
Figure 2: Plot of absolute value of DFFITS using k3

Figure 1 .
Figure 1.Plot of residuals   , i e k q using 3 k

Figure 7 :
Figure 7: Plot of DFFITSi a versus q using k3

Table 1 :
The most influential cases according to DFFITS a

Table 2 :
Influential observations and distance values of some data sets used in literature

Table 2 ,
we see that our detections agree with the literature [