A Review on Optimal Subsampling Methods for Massive Datasets

Yao, Yaqiong; Wang, HaiYing

doi:10.6339/21-JDS999

Journal of Data Science

A Review on Optimal Subsampling Methods for Massive Datasets

Volume 19, Issue 1 (2021), pp. 151–172

Yaqiong Yao HaiYing Wang

https://doi.org/10.6339/21-JDS999

Pub. online: 28 January 2021 Type: Data Science Reviews

Received
1 September 2020

Accepted
1 October 2020

Published
28 January 2021

Abstract

Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.

Supplementary material

Supplementary Material

The R functions mentioned in the paper for the optimal subsampling algorithms and all datasets can be found on the Journal of Data Science website.

References

Ai M, Yu J, Zhang H, Wang H (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. Forthcoming, https://doi.org/10.5705/ss.202018.0439.

Cheng Q, Wang H, Yang M (2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference, 209: 112–122.

Derezinski M, Warmuth MKK, Hsu DJ (2018). Leveraged volume sampling for linear regression. In: Advances in Neural Information Processing Systems (S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.), volume 31, 2505–2514. Curran Associates, Inc.

Drineas P, Mahoney M, Muthukrishnan S, Sarlos T (2011). Faster least squares approximation. Numerische Mathematik, 117: 219–249.

Drineas P, Mahoney MW, Muthukrishnan S (2006). Sampling algorithms for l₂ regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, SODA ’06, 1127–1136. Society for Industrial and Applied Mathematics.

Dua D, Graff C (2017). UCI machine learning repository.

Fanaee-T H, Gama J (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2: 113–127.

Fithian W, Hastie T (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics, 42(5): 1693–1724.

Han L, Tan KM, Yang T, Zhang T (2020). Local uncertainty sampling for large-scale multiclass logistic regression. Annals of Statistics, 48(3): 1770–1788.

Koenker R (2020). quantreg: Quantile Regression. R package version 5.55.

Lin N, Xie R (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4: 73–83.

Lumley T (2020). survey: Analysis of Complex Survey Samples. R package version 4.0.

Ma P, Mahoney MW, Yu B (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1): 861–911.

Ma P, Sun X (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1): 70–76.

Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (S Chiappa, R Calandra, eds.), volume 108 of Proceedings of Machine Learning Research, 1026–1035. PMLR, Online.

Mahoney MW (2011). Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2): 123–224.

Portnoy S, Koenker R, et al. (1997). The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science, 12(4): 279–300.

Pronzato L, Wang H (2021). Sequential online subsampling for thinning experimental designs. Journal of Statistical Planning and Inference, 212: 169–193.

R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Schifano ED, Wu J, Wang C, Yan J, Chen MH (2016). Online updating of statistical inference in the big data setting. Technometrics, 58(3): 393–403.

Toulis P, Airoldi EM, et al. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Annals of Statistics, 45(4): 1694–1727.

Wang H (2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3): 1–19.

Wang H (2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132): 1–59.

Wang H, Ma Y (2020). Optimal subsampling for quantile regression in big data. Biometrika, in press. Forthcoming, https://doi.org/10.1093/biomet/asaa043.

Wang H, Yang M, Stufken J (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525): 393–405.

Wang H, Zhu R, Ma P (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522): 829–844.

Yao Y, Wang H (2018). Optimal subsampling for softmax regression. Statistical Papers, 60: 585–599.

Yu J, Wang H, Ai M, Zhang H (2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association. Forthcoming, https://doi.org/10.1080/01621459.2020.1773832.

This is a free to read article.

Keywords

Asymptotic mean squared error big data

Metrics

since February 2021

5972

Article info
views

2817

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file