A Platform for Large Scale Statistical Modelling in R
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220
Pub. online: 24 May 2024
Type: Computing In Data Science
Open Access
Received
30 July 2023
30 July 2023
Accepted
7 April 2024
7 April 2024
Published
24 May 2024
24 May 2024
Abstract
With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.
Supplementary material
Supplementary MaterialThe supplementary material includes a zipped directory of the source packages composing largescaler . The packages can also be accessed on GitHub through the following hyperlinks:
•
orcv
•
chunknet
•
largescaleobjects
•
largescalemodels
References
Boja C, Pocovnicu A, Batagan L (2012). Distributed parallel architecture for big data. Informatică Economică, 16(2): 116.MR2965745
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1): 1–122. https://doi.org/10.1561/2200000016
Kane MJ, Emerson J, Weston S (2013). Scalable strategies for computing with massive data. Journal of Statistical Software, 55(14): 1–19. https://doi.org/10.18637/jss.v055.i14
Mateos G, Bazerque JA, Giannakis GB (2010). Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10): 5262–5276. https://doi.org/10.1109/TSP.2010.2055862MR2722673
Quine WV (1979). Mathematical Logic. Harvard University Press, London, England.MR0695499
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11): 56–65. https://doi.org/10.1145/2934664
Zeng Y, Breheny P (2017). The biglasso package: A memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint: https://arxiv.org/abs/1701.05936.