Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 19, Issue 1 (2021)
  4. A Review on Optimal Subsampling Methods ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

A Review on Optimal Subsampling Methods for Massive Datasets
Volume 19, Issue 1 (2021), pp. 151–172
Yaqiong Yao   HaiYing Wang  

Authors

 
Placeholder
https://doi.org/10.6339/21-JDS999
Pub. online: 28 January 2021      Type: Data Science Reviews     

Received
1 September 2020
Accepted
1 October 2020
Published
28 January 2021

Abstract

Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.

Supplementary material

 Supplementary Material
The R functions mentioned in the paper for the optimal subsampling algorithms and all datasets can be found on the Journal of Data Science website.

References

 
Ai M, Yu J, Zhang H, Wang H (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. Forthcoming, https://doi.org/10.5705/ss.202018.0439.
 
Cheng Q, Wang H, Yang M (2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference, 209: 112–122.
 
Derezinski M, Warmuth MKK, Hsu DJ (2018). Leveraged volume sampling for linear regression. In: Advances in Neural Information Processing Systems (S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.), volume 31, 2505–2514. Curran Associates, Inc.
 
Drineas P, Mahoney M, Muthukrishnan S, Sarlos T (2011). Faster least squares approximation. Numerische Mathematik, 117: 219–249.
 
Drineas P, Mahoney MW, Muthukrishnan S (2006). Sampling algorithms for l2 regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, SODA ’06, 1127–1136. Society for Industrial and Applied Mathematics.
 
Dua D, Graff C (2017). UCI machine learning repository.
 
Fanaee-T H, Gama J (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2: 113–127.
 
Fithian W, Hastie T (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics, 42(5): 1693–1724.
 
Han L, Tan KM, Yang T, Zhang T (2020). Local uncertainty sampling for large-scale multiclass logistic regression. Annals of Statistics, 48(3): 1770–1788.
 
Koenker R (2020). quantreg: Quantile Regression. R package version 5.55.
 
Lin N, Xie R (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4: 73–83.
 
Lumley T (2020). survey: Analysis of Complex Survey Samples. R package version 4.0.
 
Ma P, Mahoney MW, Yu B (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1): 861–911.
 
Ma P, Sun X (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7(1): 70–76.
 
Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (S Chiappa, R Calandra, eds.), volume 108 of Proceedings of Machine Learning Research, 1026–1035. PMLR, Online.
 
Mahoney MW (2011). Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2): 123–224.
 
Portnoy S, Koenker R, et al. (1997). The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science, 12(4): 279–300.
 
Pronzato L, Wang H (2021). Sequential online subsampling for thinning experimental designs. Journal of Statistical Planning and Inference, 212: 169–193.
 
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 
Schifano ED, Wu J, Wang C, Yan J, Chen MH (2016). Online updating of statistical inference in the big data setting. Technometrics, 58(3): 393–403.
 
Toulis P, Airoldi EM, et al. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. Annals of Statistics, 45(4): 1694–1727.
 
Wang H (2019a). Divide-and-conquer information-based optimal subdata selection algorithm. Journal of Statistical Theory and Practice, 13(3): 1–19.
 
Wang H (2019b). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132): 1–59.
 
Wang H, Ma Y (2020). Optimal subsampling for quantile regression in big data. Biometrika, in press. Forthcoming, https://doi.org/10.1093/biomet/asaa043.
 
Wang H, Yang M, Stufken J (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525): 393–405.
 
Wang H, Zhu R, Ma P (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522): 829–844.
 
Yao Y, Wang H (2018). Optimal subsampling for softmax regression. Statistical Papers, 60: 585–599.
 
Yu J, Wang H, Ai M, Zhang H (2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association. Forthcoming, https://doi.org/10.1080/01621459.2020.1773832.

Related articles PDF XML
Related articles PDF XML

Copyright
© 2021 The Author(s).
This is a free to read article.

Keywords
Asymptotic mean squared error big data

Metrics (since February 2021)
848

Article info
views

0

Full article
views

947

PDF
downloads

244

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy