Supplementary Material

JDS

Journal of Data Science

1683-8602 1680-743X

1680-743X

School of Statistics, Renmin University of China

JDS999

10.6339/21-JDS999

Data Science Review

A Review on Optimal Subsampling Methods for Massive Datasets

Yao

Yaqiong

1 Wang

HaiYing

haiying.wang@uconn.edu1∗ 1Department of Statistics, University of Connecticut, Storrs, CT, USA

∗Corresponding author. Email: haiying.wang@uconn.edu.

2021

2812021

191151172

Supplementary Material

The R functions mentioned in the paper for the optimal subsampling algorithms and all datasets can be found on the Journal of Data Science website.

92020 102020

2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2021

Open access article under the CC BY license.

Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.

Keywords Asymptotic mean squared error big data

References

, Yu

, Zhang

, Wang

(2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. Forthcoming, https://doi.org/10.5705/ss.202018.0439.

Cheng

, Wang

, Yang

(2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference, 209: 112–122.

Derezinski

, Warmuth

MKK

, Hsu

(2018). Leveraged volume sampling for linear regression. In: Advances in Neural Information Processing Systems (

Bengio,

Wallach,

Larochelle,

Grauman,

Cesa-Bianchi,

Garnett, eds.), volume 31, 2505–2514. Curran Associates, Inc.

Drineas

, Mahoney

, Muthukrishnan

, Sarlos

(2011). Faster least squares approximation. Numerische Mathematik, 117: 219–249.

Drineas

, Mahoney

, Muthukrishnan

(2006). Sampling algorithms for l₂ regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, SODA ’06, 1127–1136. Society for Industrial and Applied Mathematics.

Dua

, Graff

(2017). UCI machine learning repository.

Fanaee-T

, Gama

(2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2: 113–127.

Fithian

, Hastie

(2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics, 42(5): 1693–1724.

Han

, Tan

, Yang

, Zhang

(2020). Local uncertainty sampling for large-scale multiclass logistic regression. Annals of Statistics, 48(3): 1770–1788.

Koenker

(2020). quantreg: Quantile Regression. R package version 5.55.

Lin

, Xie

(2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4: 73–83.

Lumley

(2020). survey: Analysis of Complex Survey Samples. R package version 4.0.

, Mahoney

, Yu

(2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1): 861–911.

, Sun

(2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1): 70–76.

, Zhang

, Xing

, Ma

, Mahoney

(2020). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (

Chiappa,

Calandra, eds.), volume 108 of Proceedings of Machine Learning Research, 1026–1035. PMLR, Online.

Mahoney

(2011). Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2): 123–224.

Portnoy

, Koenker

, et al. (1997). The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science, 12(4): 279–300.

Pronzato

, Wang

(2021). Sequential online subsampling for thinning experimental designs. Journal of Statistical Planning and Inference, 212: 169–193.

R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Schifano

, Wu

, Wang

, Yan

, Chen

(2016). Online updating of statistical inference in the big data setting. Technometrics, 58(3): 393–403.

Toulis

, Airoldi

, et al. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Annals of Statistics, 45(4): 1694–1727.

Wang

(2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3): 1–19.

Wang

(2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132): 1–59.

Wang

, Ma

(2020). Optimal subsampling for quantile regression in big data. Biometrika, in press. Forthcoming, https://doi.org/10.1093/biomet/asaa043.

Wang

, Yang

, Stufken

(2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525): 393–405.

Wang

, Zhu

, Ma

(2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522): 829–844.

Yao

, Wang

(2018). Optimal subsampling for softmax regression. Statistical Papers, 60: 585–599.

, Wang

, Ai

, Zhang

(2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association. Forthcoming, https://doi.org/10.1080/01621459.2020.1773832.