Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data
Volume 21, Issue 1 (2023), pp. 87–105
Pub. online: 7 July 2022
Type: Computing In Data Science
Open Access
Received
20 February 2022
20 February 2022
Accepted
26 May 2022
26 May 2022
Published
7 July 2022
7 July 2022
Abstract
Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.
Supplementary material
Supplementary Material.zip contains the following files and/or directories:
•
/code and data/: Directory that includes code and files necessary to reproduce the numerical results presented in this paper.
•
supplementary.pdf: Online supplementary material.