Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data

Zhang, Zhang; He, Zhibing; Qin, Yichen; Shen, Ye; Shia, Ben-Chang; Li, Yang

doi:10.6339/22-JDS1052

Journal of Data Science

Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data

Volume 21, Issue 1 (2023), pp. 87–105

Zhang Zhang Zhibing He Yichen Qin All authors (6)

https://doi.org/10.6339/22-JDS1052

Pub. online: 7 July 2022 Type: Computing In Data Science

Open Access

Received
20 February 2022

Accepted
26 May 2022

Published
7 July 2022

Abstract

Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.

Supplementary material

Supplementary Material

.zip contains the following files and/or directories: • /code and data/: Directory that includes code and files necessary to reproduce the numerical results presented in this paper. • supplementary.pdf: Online supplementary material.

References

Bickel PJ, Götze F, van Zwet WR (2012). Resampling fewer than n observations: gains, losses, and remedies for losses. In: Selected Works of Willem van Zwet, 267–297. Springer.

Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32.

Chatterjee A, Lahiri SN (2011). Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494): 608–625.

Chen X, Xie Mg (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24: 1655–1684.

De Bin R, Janitza S, Sauerbrei W, Boulesteix AL (2016). Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics, 72(1): 272–280.

Efron B, Hastie T, Johnstone I, Tibshirani R, et al. (2004). Least angle regression. Annals of Statistics, 32(2): 407–499.

Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360.

Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911.

Fan TH, Cheng KF (2007). Tests and variables selection on regression analysis for massive datasets. Data & Knowledge Engineering, 63(3): 811–819.

Genkin A, Lewis DD, Madigan D (2007). Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3): 291–304.

Hong C, Wang Y, Cai T (2022). A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics, 23(2): 397–411.

Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4): 795–816.

Li R, Zhong W, Zhu L (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499): 1129–1139.

Lin Y, Jeon Y (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474): 578–590.

Liu L, Gu H, Van Limbergen J, Kenney T (2021). Surf: A new method for sparse variable selection, with application in microbiome data analysis. Statistics in Medicine, 40(4): 897–919.

Meier L, Van De Geer S, Bühlmann P (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 53–71.

Meinshausen N (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1): 374–393.

Meinshausen N, Bühlmann P (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4): 417–473.

Shao J (1996). Bootstrap model selection. Journal of the American Statistical Association, 91(434): 655–665.

Tang L, Zhou L, Song PXK (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176: 104567.

Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.

Tibshirani RJ, Efron B (1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability, 57: 1–436.

Wang K, Li S, Zhang B (2021a). Robust communication-efficient distributed composite quantile regression and variable selection for massive data. Computational Statistics & Data Analysis, 161: 107262.

Wang Y, Hong C, Palmer N, Di Q, Schwartz J, Kohane I, et al. (2021b). A fast divide-and-conquer sparse cox regression. Biostatistics, 22(2): 381–401.

Wu CFJ, et al. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. Annals of Statistics, 14(4): 1261–1295.

Xie J, Lin Y, Yan X, Tang N (2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760.

Yao W, Wang Q (2013). Robust variable selection through mave. Computational Statistics & Data Analysis, 63: 42–49.

Yuan M, Lin Y (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1): 49–67.

Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2): 894–942.

Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429.

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

distributed computing large-scale dataset scalable bootstrap variable selection

Funding

Dr. Yang Li was supported by Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China and National Natural Science Foundation of China (71771211).

Metrics

since February 2021

896

Article info
views

619

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file