Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 21, Issue 1 (2023)
  4. Variable Selection with Scalable Bootstr ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data
Volume 21, Issue 1 (2023), pp. 87–105
Zhang Zhang   Zhibing He   Yichen Qin     All authors (6)

Authors

 
Placeholder
https://doi.org/10.6339/22-JDS1052
Pub. online: 7 July 2022      Type: Computing In Data Science      Open accessOpen Access

Received
20 February 2022
Accepted
26 May 2022
Published
7 July 2022

Abstract

Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.

Supplementary material

 Supplementary Material
.zip contains the following files and/or directories: • /code and data/: Directory that includes code and files necessary to reproduce the numerical results presented in this paper. • supplementary.pdf: Online supplementary material.

References

 
Bickel PJ, Götze F, van Zwet WR (2012). Resampling fewer than n observations: gains, losses, and remedies for losses. In: Selected Works of Willem van Zwet, 267–297. Springer.
 
Breiman L (2001). Random forests. Machine Learning, 45(1): 5–32.
 
Chatterjee A, Lahiri SN (2011). Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494): 608–625.
 
Chen X, Xie Mg (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24: 1655–1684.
 
De Bin R, Janitza S, Sauerbrei W, Boulesteix AL (2016). Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics, 72(1): 272–280.
 
Efron B, Hastie T, Johnstone I, Tibshirani R, et al. (2004). Least angle regression. Annals of Statistics, 32(2): 407–499.
 
Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360.
 
Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849–911.
 
Fan TH, Cheng KF (2007). Tests and variables selection on regression analysis for massive datasets. Data & Knowledge Engineering, 63(3): 811–819.
 
Genkin A, Lewis DD, Madigan D (2007). Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3): 291–304.
 
Hong C, Wang Y, Cai T (2022). A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics, 23(2): 397–411.
 
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4): 795–816.
 
Li R, Zhong W, Zhu L (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499): 1129–1139.
 
Lin Y, Jeon Y (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474): 578–590.
 
Liu L, Gu H, Van Limbergen J, Kenney T (2021). Surf: A new method for sparse variable selection, with application in microbiome data analysis. Statistics in Medicine, 40(4): 897–919.
 
Meier L, Van De Geer S, Bühlmann P (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 53–71.
 
Meinshausen N (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1): 374–393.
 
Meinshausen N, Bühlmann P (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4): 417–473.
 
Shao J (1996). Bootstrap model selection. Journal of the American Statistical Association, 91(434): 655–665.
 
Tang L, Zhou L, Song PXK (2020). Distributed simultaneous inference in generalized linear models via confidence distribution. Journal of Multivariate Analysis, 176: 104567.
 
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.
 
Tibshirani RJ, Efron B (1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability, 57: 1–436.
 
Wang K, Li S, Zhang B (2021a). Robust communication-efficient distributed composite quantile regression and variable selection for massive data. Computational Statistics & Data Analysis, 161: 107262.
 
Wang Y, Hong C, Palmer N, Di Q, Schwartz J, Kohane I, et al. (2021b). A fast divide-and-conquer sparse cox regression. Biostatistics, 22(2): 381–401.
 
Wu CFJ, et al. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. Annals of Statistics, 14(4): 1261–1295.
 
Xie J, Lin Y, Yan X, Tang N (2020). Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. Journal of the American Statistical Association, 115(530): 747–760.
 
Yao W, Wang Q (2013). Robust variable selection through mave. Computational Statistics & Data Analysis, 63: 42–49.
 
Yuan M, Lin Y (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1): 49–67.
 
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2): 894–942.
 
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429.

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
distributed computing large-scale dataset scalable bootstrap variable selection

Funding
Dr. Yang Li was supported by Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China and National Natural Science Foundation of China (71771211).

Metrics
since February 2021
874

Article info
views

550

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy