A Review on Optimal Subsampling Methods for Massive Datasets
Volume 19, Issue 1 (2021), pp. 151–172
Pub. online: 28 January 2021
Type: Data Science Reviews
Received
1 September 2020
1 September 2020
Accepted
1 October 2020
1 October 2020
Published
28 January 2021
28 January 2021
Abstract
Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.
Supplementary material
Supplementary MaterialThe R functions mentioned in the paper for the optimal subsampling algorithms and all datasets can be found on the Journal of Data Science website.
References
Ai M, Yu J, Zhang H, Wang H (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. Forthcoming, https://doi.org/10.5705/ss.202018.0439.
Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (S Chiappa, R Calandra, eds.), volume 108 of Proceedings of Machine Learning Research, 1026–1035. PMLR, Online.
Wang H, Ma Y (2020). Optimal subsampling for quantile regression in big data. Biometrika, in press. Forthcoming, https://doi.org/10.1093/biomet/asaa043.
Yu J, Wang H, Ai M, Zhang H (2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association. Forthcoming, https://doi.org/10.1080/01621459.2020.1773832.