Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1052

10.6339/22-JDS1052

Computing in Data Science

Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data

Zhang

1 He

Zhibing

2 Qin

Yichen

3 Shen

4 Shia

Ben-Chang

5 Li

Yang

yang.li@ruc.edu.cn16∗ 1Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China 2School of Mathematical and Statistical Sciences, Arizona State University, AZ, USA 3Department of Operations, Business Analytics, and Information Systems, University of Cincinnati, OH, USA 4College of Public Health, University of Georgia, GA, USA 5Graduate Institute of Business Administration and College of Management, Fu Jen Catholic University, Taiwan 6RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China

∗Corresponding author. Email: yang.li@ruc.edu.cn.

2023

772022

21187105

Supplementary Material

.zip contains the following files and/or directories: •

/code and data/: Directory that includes code and files necessary to reproduce the numerical results presented in this paper.

•

supplementary.pdf: Online supplementary material.

20220222652022

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2023

Open access article under the CC BY license.

Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.

Keywords distributed computing large-scale dataset scalable bootstrap variable selection

National Natural Science Foundation of China

71771211

Dr. Yang Li was supported by Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China and National Natural Science Foundation of China (71771211).

References

Bickel

, Götze

, van Zwet

(2012). Resampling fewer than n observations: gains, losses, and remedies for losses. In: Selected Works of Willem van Zwet, 267–297. Springer.

Breiman

(2001). Random forests. Machine Learning, 45(1): 5–32.

Chatterjee

, Lahiri

(2011). Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494): 608–625.

Chen

, Xie

(2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24: 1655–1684.

De Bin

, Janitza

, Sauerbrei

, Boulesteix

(2016). Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics, 72(1): 272–280.

Efron

, Hastie

, Johnstone

, Tibshirani

, et al. (2004). Least angle regression. Annals of Statistics, 32(2): 407–499.

Fan

, Li

(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360.

Fan

, Lv

(2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911.

Fan

, Cheng

(2007). Tests and variables selection on regression analysis for massive datasets. Data & Knowledge Engineering, 63(3): 811–819.

Genkin

, Lewis

, Madigan

(2007). Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3): 291–304.

Hong

, Wang

, Cai

(2022). A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics, 23(2): 397–411.

Kleiner

, Talwalkar

, Sarkar

, Jordan

(2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4): 795–816.

, Zhong

, Zhu

(2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499): 1129–1139.

Lin

, Jeon

(2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474): 578–590.

Liu

, Gu

, Van Limbergen

, Kenney

(2021). Surf: A new method for sparse variable selection, with application in microbiome data analysis. Statistics in Medicine, 40(4): 897–919.

Meier

, Van De Geer

, Bühlmann

(2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 53–71.

Meinshausen

(2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1): 374–393.

Meinshausen

, Bühlmann

(2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4): 417–473.

Shao

(1996). Bootstrap model selection. Journal of the American Statistical Association, 91(434): 655–665.

Tang

, Zhou

, Song

PXK

(2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176: 104567.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.

Tibshirani

, Efron

(1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability, 57: 1–436.

Wang

, Li

, Zhang

(2021a). Robust communication-efficient distributed composite quantile regression and variable selection for massive data. Computational Statistics & Data Analysis, 161: 107262.

Wang

, Hong

, Palmer

, Di

, Schwartz

, Kohane

, et al. (2021b). A fast divide-and-conquer sparse cox regression. Biostatistics, 22(2): 381–401.

CFJ

, et al. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. Annals of Statistics, 14(4): 1261–1295.

Xie

, Lin

, Yan

, Tang

(2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760.

Yao

, Wang

(2013). Robust variable selection through mave. Computational Statistics & Data Analysis, 63: 42–49.

Yuan

, Lin

(2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1): 49–67.

Zhang

(2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2): 894–942.

Zou

(2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429.