FORWARD REGRESSION IN R: FROM THE EXTREME SLOW TO THE EXTREME FAST

Forward regression has been criticised heavily and one of the many reasons is regarding its speed and its stopping criteria. The main focus of this paper is on demonstrating how to make it efficient, using R. Our method works for continuous predictor variables only, as the use of the partial correlation plays the most important role.


Introduction
It is common truth that efficient code is not a matter of computer, operating system, or programming language, but a matter of implementation. Despite this ordinal common reasoning argument, not many programmers, scientific developers and mainly researchers put too much emphasis on this aspect. Researchers looking for speed in their calculation tend to use Matlab, Python, C++, etc. In general, the aforementioned languages or software are faster than R. What most people are not aware of though is that R has many "powerful" functions whose efficiency is comparable to Matlab and Python (see Ozgur et al. (2017) for a comparison).
Computational efficiency is a main drawback of forward regression as well.
Neverthe-less it is taught in the undergraduate departments of statistics across the globe. This paper deals exactly with this issue, right at the heart of the problem, how to make for-ward regression efficient. To this end, we will make use of R's built-in command cor ; a very efficiently written function whose abilities we will take into advantage here.
The paper focuses on the case where both the response and the predictor variables are continuous (i.e. not categorical nor survival, etc.). In the next section we present the forward selection algorithm using the correlation coefficient. Section 3 shows, in R, the key algorithmic point responsible for reducing execution time. Examples demonstrate the great computational savings and Section 4 concludes the paper.

Forward selection method for linear models
Despite the title of this Section, the general idea of the forward selection method is applicable to all kinds of models, linear or not, regardless of normally distributed error or not. Suppose we have a univariate response variable ∈ ℝ and a matrix of p predictor variables X ∈ ℝ × . The algorithm below summarizes the method.
Forward selection method 1. Standardise all predictor variables, (make them have zero mean and unit variance) and center the values of the response variable.
2. Examine all univariate associations between y and xi, for i =1, ⋯ , . The jth most significant among the significant ones enters the model. 3. Examine all conditional associations (y, xi) |CS. The most significant among the significant ones enters the model. CS stands for conditioning set.

Repeat
Step 3, increasing the CS every time by 1 (one variable is added) until no significant association exists.
In the next 2 subsections we will see how to perform Steps 2-4. That is, how to assess the significance of the (conditional) associations.

Signicance of a univariate association
In order to assess the univariate associations, Pearson's correlation coefficient is calculated. The significance of the sample correlation r will be assessed via Fisher's ztransform (Fisher, 1915) where ρ is the true correlation coefficient. Alternatively, a tn−3 distribution can be used. Note that this is different from assessing the significance of the slope coefficient of a simple linear regression in which case we would use the following test statistic Under H0, T ∼ tn−2. Note that in the simple linear regression T 2 coincides with the Wald test statistic for the H0 : β = 0 which is given by and the usual F-test where SSE1 is the sum of squares of the errors of the regression model (1) and SSE0 is simply( − 1) 2 , where 2 is the variance of y. The proof of straightforward and hence omitted. The z or t test based upon Fisher's z-transformation is asymptotically equivalent to the Wald (and hence the T 2 and the F tests) (Anderson, 2003).
The convenience of the z-transformation lies in the fact that its standard error has a known formula and requires no further calculation and this is taken into advantage when the computational cost is examined. The use of the correlation and partial correlation instead of many linear regression models has also been advised by Weisberg (2005) for the purpose of computational savings.

Significance of a conditional association
In the second and latter steps of the forward regression, the significance of the conditional correlations is to be examined. The two models are Hence, under 0 = 0. The partial F-test, or in general the F-test for comparing two models, one nested within the other, is The conditional correlation must be used with the cardinality of the conditional set being, | | = , i.e there are q variables. When there is only one conditioning variable, | | = 1 the conditional correlation has a closed form which we also take into account in order to decrease the computational cost where r(y; x) denotes the correlation between y & x, and r(y; z), r(x; z) are the correlation coefficients between y & z and between x & z respectively. The general method of calculating the partial correlation, for | | = ≥ 1 is given by (Fisher et al., 1924) where 1 and 2 are the residuals of the two regression models The variance of r(y; x|z) is equal to 1 = (n -3 -q).
The γ coefficient of the second model (3) is related to the partial correlation (4) via the following relationship where W is the Wald test statistic (2) for the γ coefficient.
Note that, both for the partial correlation and the partial F test, two linear regression models are fitted. Hence, one can argue that the computational cost is the same. We will see however later on that this is not true when using R. Alternatively, the relationship equation can be used with one fitted model.

Forward selection with linear models in R
The R code for the forward selection method using the partial correlation can be found in the package Rfast (Papadakis et al., 2018). The first two steps of the method are rather straightforward. We remind the reader that all predictor variables have been centred and scaled to have unit variance and the response variable is only centred. In the first step, the command cor(y, x) is used, where y is a vector and x is a matrix. In the second step, the partial correlation, with one conditioning variable, is applied. The next (if necessary) steps are the most important ones regarding the speed gains. For this reason we take into advantage a very powerful base command in R, the .lm.fit, which is 10 times faster than lm.fit.
In the next two lines of R code, z is the selected variables at step k (conditioning set), x is the whole set of variables, including z and y is the response variable. The vector of partial correlations (4) between the response and each of the predictor variables conditioning on the selected variables is then calculated in the third line.
z <-x[, sela] e1 <-.lm.fit(z, y)$residuals e2 <-.lm.fit(z, x)$residuals yx.z <-cor(e2, e1) By taking a close look in the above R code segment, you will see that the calculation of the partial correlation coefficient via the residuals (5) is what makes the forward regression fast. According the algorithm of forward search, one must "scan" all non selected predictor variables at each step. So, with 1000 predictor variables, if 10 variables were to be selected, one would have to create roughly 10; 000 regression models. Using a for loop in R, this has already become slow.

FORWARD REGRESSION IN R: FROM THE EXTREME SLOW TO THE EXTREME FAST
We will now bridge the mathematics with the R code. The estimates of the linear regression coeffcients are given by ̂= ( ) −1 .
Let us translate the two residual objects of the above R code into mathematics using (7) Hence,instead of having to "scan" the design matrix of predictor variables at every step, all we need to do is two multiplications, one by the left and one by the right. The using the cor command, we get the vector of partial correlation coefficients at almost no time.
When a new candidate variable is tried, it is possible that the crossproduct of the design matrix may not be invertible, because the candidate variable is highly correlated with a previously selected variable, and thus a check should be made at every step. However, R's implementation of the forward regression will work just fine, because the QR decomposition implementation behind handles these cases.

Examples comparing execution time
We compared our implementation against the more general implementation of forward regression available in MXM (Lagani et al., 2017). MXM is a variable selection oriented package, offering many different methods for many types of data. The comparison might not look fair, mainly because the MXM's command lm.fsreg for linear models is generic, it is designed to accept any type of predictor variables, continuous and/or categorical and uses command lm for this reason. In addition, the algorithm has not been optimized, using .lm.fit and incrementally updating the design matrix. That implementation is based on the algorithm of the forward regression. At each step, all the remaining variables are "scanned" and their significance is examined.
We did two single examples, with only 500 predictor variables. In the first case no variable is associated with the response variable. In the second case, the response is linearly related with three predictor variables. The relevant code to produce the data and the output using the package microbenchmark are given below.
x <-matrix(runif(500 * 500, 1, 100), ncol = 500) We can see that our implementation is 100 times faster than MXM's generic implementation when no predictor variable is truly significant. It is 274 times faster when there are only three predictor variables truly significant. In the first case, two variables were selected, whereas in the second case, the three significant variables were selected.
In order to get better insights into the time savings we expanded our simulation studies with larger sample sizes and higher number of predictor variables using the same case scenarios. The sample sizes we tried were n = (500; 1; 000; 5; 000; 10; 000; 20; 000; 50; 000) and the number of variables spanned from 100 to 1000, with an increasing step of 100. Figure   2 presents the speed-up factors of cor.fsreg relative lm.fsreg which indicate how slower the latter is relative to the first one. With small sample sizes for example (left column of plots) lm.fsreg can be more than 250 times slower than cor.fsreg. This number decreases with larger sample sizes (right column of plots), but is above 10 nonetheless. To give an example though, if lm.freg required on average 3; 400 seconds, nearly an hour, cor.fsreg required 280 seconds, less than 5 minutes.

Conclusions
We have showed how to speed up dramatically the forward regression in the linear models case with continuous predictor variables only. A comparison of Rfast's optimised forward regression implementation, for the continuous response and continuous predictor variables case, with the generic (not fully optimised) forward regression implementation in MXM showed dramatic speed-up factors. The relative differences decrease with larger sample sizes, yet, they are significant.
We demonstrated the fact that it is not a matter of computer, the computational power, or even of the algorithm many times, but a matter of the implementation. This sentence is very common, yet not everybody writes functions and packages without taking this into account.
The second message here is that in order to have really fast functions, one must take cases and optimise each case separately, rather than have a generic function. It becomes clear though, that the optimisation of the generic function is the optimal solution.