Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 21, Issue 1 (2023)
  4. Sampling-based Gaussian Mixture Regressi ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Sampling-based Gaussian Mixture Regression for Big Data
Volume 21, Issue 1 (2023), pp. 158–172
JooChul Lee   Elizabeth D. Schifano   HaiYing Wang  

Authors

 
Placeholder
https://doi.org/10.6339/22-JDS1057
Pub. online: 9 August 2022      Type: Statistical Data Science      Open accessOpen Access

Received
29 May 2022
Accepted
2 July 2022
Published
9 August 2022

Abstract

This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.

Supplementary material

 Supplementary Material
• Software: R codes used for the proposed methods and algorithms are available on GitHub https://github.com/pedigree07/OPTMixture. • Supplementary document: The supplementary document provides the proofs of the theorems.

References

 
Ai M, Wang F, Yu J, Zhang H (2021a). Optimal subsampling for large-scale quantile regression. Journal of Complexity, 62: 101512.
 
Ai M, Yu J, Zhang H, Wang H (2021b). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31: 749–772.
 
Candanedo LM, Feldheim V, Deramaix D (2017). Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140: 81–97.
 
Dempster AP, Laird NM, Rubin DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22.
 
Drineas P, Mahoney MW, Muthukrishnan S (2006). Sampling algorithms for l2 regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, 1127–1136.
 
Lee J, Schifano ED, Wang H (2021). Fast optimal subsampling probability approximation for generalized linear models. Econometrics and Statistics, doi: https://doi.org/10.1016/j.ecosta.2021.02.007.
 
Lumley T, Scott A (2015). Aic and bic for modeling with complex survey data. Journal of Survey Statistics and Methodology, 3(1): 1–18.
 
Ma P, Mahoney M, Yu B (2014). A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning, 91–99. PMLR.
 
McLachlan G, Peel D (2004). Finite Mixture Models Wiley Series in Probability and Statistics. Wiley.
 
Wang H (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20(132): 1–59.
 
Wang H, Kim JK (2020). Maximum sampled conditional likelihood for informative subsampling. arXiv preprint: https://arxiv.org/abs/2011.05988.
 
Wang H, Ma Y (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1): 99–112.
 
Wang H, Yang M, Stufken J (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, 114(525): 393–405.
 
Wang H, Zhu R, Ma P (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522): 829–844.
 
Yao Y, Wang H (2019). Optimal subsampling for softmax regression. Statistical Papers, 60(2): 585–599.
 
Yu J, Wang H, Ai M, Zhang H (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537): 265–276.
 
Zuo L, Zhang H, Wang H, Liu L (2021). Sampling-based estimation for massive survival data with additive hazards model. Statistics in Medicine, 40(2): 441–450.

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
EM algorithm massive data optimal probabilities supsampling

Funding
HaiYing Wang’s research was partially supported by the US NSF grant CCF-2105571.

Metrics (since February 2021)
57

Article info
views

0

Full article
views

139

PDF
downloads

81

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy