A Platform for Large Scale Statistical Modelling in R

Cairns, Jason; Urbanek, Simon; Murrell, Paul

doi:10.6339/24-JDS1132

Journal of Data Science

A Platform for Large Scale Statistical Modelling in R

Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220

Jason Cairns Simon Urbanek Paul Murrell

https://doi.org/10.6339/24-JDS1132

Pub. online: 24 May 2024 Type: Computing In Data Science

Open Access

Received
30 July 2023

Accepted
7 April 2024

Published
24 May 2024

Abstract

With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.

Supplementary material

Supplementary Material

The supplementary material includes a zipped directory of the source packages composing largescaler. The packages can also be accessed on GitHub through the following hyperlinks: • orcv • chunknet • largescaleobjects • largescalemodels

References

Boja C, Pocovnicu A, Batagan L (2012). Distributed parallel architecture for big data. Informatică Economică, 16(2): 116.MR2965745

Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1): 1–122. https://doi.org/10.1561/2200000016

Cairns J (2024). A Platform for Large-Scale Statistical Modelling in R, Ph.D. thesis, University of Auckland.

Eddelbuettel D (2024). CRAN task view: High-performance and parallel computing with r.

Gordon MJC (1984). The Denotational Description of Programming Languages. 1st edition. Springer, New York, NY.

Kane MJ, Emerson J, Weston S (2013). Scalable strategies for computing with massive data. Journal of Statistical Software, 55(14): 1–19. https://doi.org/10.18637/jss.v055.i14

Kleppmann M (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, Inc.

Luraschi J, Kuo K, Ushey K, Allaire J (2020). Sparklyr: R interface to Apache Spark. R package version 1.1.0.

Mateos G, Bazerque JA, Giannakis GB (2010). Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10): 5262–5276. https://doi.org/10.1109/TSP.2010.2055862 MR2722673

Pike R (2012). Concurrency is not parallelism. Heroku.

Quine WV (1979). Mathematical Logic. Harvard University Press, London, England.MR0695499

Schmidt D, Chen WC, de la Chapelle SL, Ostrouchov G, Patel P (2020). pbdBASE: pbdR base wrappers for distributed matrices. R package version 0.5-3.

Shvachko K, Kuang H, Radia S, Chansler R (2010). The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10. IEEE.

Weston S (2017). doMPI: Foreach parallel adaptor for the Rmpi package. R package version 0.2.2.

Weston S (2019a). doParallel: Foreach parallel adaptor for the ‘Parallel’ package. R package version 1.0.15.

Weston S (2019b). doSNOW: Foreach parallel adaptor for the ‘SNOW’ package. R package version 1.0.18.

Weston S (2020). Foreach: Provides Foreach Looping Construct. R package version 1.4.8.

Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), 15–28.

Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11): 56–65. https://doi.org/10.1145/2934664

Zeng Y, Breheny P (2017). The biglasso package: A memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint: https://arxiv.org/abs/1701.05936.

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

big data distributed computing modelling

Metrics

since February 2021

1558

Article info
views

738

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file