Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”
  4. A Platform for Large Scale Statistical M ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

A Platform for Large Scale Statistical Modelling in R
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220
Jason Cairns   Simon Urbanek   Paul Murrell  

Authors

 
Placeholder
https://doi.org/10.6339/24-JDS1132
Pub. online: 24 May 2024      Type: Computing In Data Science      Open accessOpen Access

Received
30 July 2023
Accepted
7 April 2024
Published
24 May 2024

Abstract

With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.

Supplementary material

 Supplementary Material
The supplementary material includes a zipped directory of the source packages composing largescaler. The packages can also be accessed on GitHub through the following hyperlinks: • orcv • chunknet • largescaleobjects • largescalemodels

References

 
Boja C, Pocovnicu A, Batagan L (2012). Distributed parallel architecture for big data. Informatică Economică, 16(2): 116.MR2965745
 
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J, et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1): 1–122. https://doi.org/10.1561/2200000016
 
Cairns J (2024). A Platform for Large-Scale Statistical Modelling in R, Ph.D. thesis, University of Auckland.
 
Eddelbuettel D (2024). CRAN task view: High-performance and parallel computing with r.
 
Gordon MJC (1984). The Denotational Description of Programming Languages. 1st edition. Springer, New York, NY.
 
Kane MJ, Emerson J, Weston S (2013). Scalable strategies for computing with massive data. Journal of Statistical Software, 55(14): 1–19. https://doi.org/10.18637/jss.v055.i14
 
Kleppmann M (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, Inc.
 
Luraschi J, Kuo K, Ushey K, Allaire J (2020). Sparklyr: R interface to Apache Spark. R package version 1.1.0.
 
Mateos G, Bazerque JA, Giannakis GB (2010). Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10): 5262–5276. https://doi.org/10.1109/TSP.2010.2055862MR2722673
 
Pike R (2012). Concurrency is not parallelism. Heroku.
 
Quine WV (1979). Mathematical Logic. Harvard University Press, London, England.MR0695499
 
Schmidt D, Chen WC, de la Chapelle SL, Ostrouchov G, Patel P (2020). pbdBASE: pbdR base wrappers for distributed matrices. R package version 0.5-3.
 
Shvachko K, Kuang H, Radia S, Chansler R (2010). The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10. IEEE.
 
Weston S (2017). doMPI: Foreach parallel adaptor for the Rmpi package. R package version 0.2.2.
 
Weston S (2019a). doParallel: Foreach parallel adaptor for the ‘Parallel’ package. R package version 1.0.15.
 
Weston S (2019b). doSNOW: Foreach parallel adaptor for the ‘SNOW’ package. R package version 1.0.18.
 
Weston S (2020). Foreach: Provides Foreach Looping Construct. R package version 1.4.8.
 
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), 15–28.
 
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11): 56–65. https://doi.org/10.1145/2934664
 
Zeng Y, Breheny P (2017). The biglasso package: A memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint: https://arxiv.org/abs/1701.05936.

Related articles PDF XML
Related articles PDF XML

Copyright
2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
big data distributed computing modelling

Metrics
since February 2021
491

Article info
views

236

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy