<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1132</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1132</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Computing in Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>A Platform for Large Scale Statistical Modelling in R</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Cairns</surname><given-names>Jason</given-names></name><email xlink:href="mailto:jason.cairns@auckland.ac.nz">jason.cairns@auckland.ac.nz</email><xref ref-type="aff" rid="j_jds1132_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Urbanek</surname><given-names>Simon</given-names></name><xref ref-type="aff" rid="j_jds1132_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Murrell</surname><given-names>Paul</given-names></name><xref ref-type="aff" rid="j_jds1132_aff_001">1</xref>
</contrib>
<aff id="j_jds1132_aff_001"><label>1</label>Department of Statistics, <institution>University of Auckland</institution>, Auckland, <country>New Zealand</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:jason.cairns@auckland.ac.nz">jason.cairns@auckland.ac.nz</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>24</day><month>5</month><year>2024</year></pub-date><volume>22</volume><issue>2</issue><fpage>208</fpage><lpage>220</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1132_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The supplementary material includes a zipped directory of the source packages composing <italic>largescaler</italic>. The packages can also be accessed on GitHub through the following hyperlinks: 
<list>
<list-item id="j_jds1132_li_001">
<label>•</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/jcai849/orcv">orcv</ext-link></p>
</list-item>
<list-item id="j_jds1132_li_002">
<label>•</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/jcai849/chunknet">chunknet</ext-link></p>
</list-item>
<list-item id="j_jds1132_li_003">
<label>•</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/jcai849/largescaleobjects">largescaleobjects</ext-link></p>
</list-item>
<list-item id="j_jds1132_li_004">
<label>•</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/jcai849/largescalemodels">largescalemodels</ext-link></p>
</list-item>
</list>
</p>
</caption>
</supplementary-material><history><date date-type="received"><day>30</day><month>7</month><year>2023</year></date><date date-type="accepted"><day>7</day><month>4</month><year>2024</year></date></history>
<permissions><copyright-statement>2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform <italic>largescaler</italic>, developed specifically for the development of statistical models for big datasets.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>big data</kwd>
<kwd>distributed computing</kwd>
<kwd>modelling</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1132_reflist_001">
<title>References</title>
<ref id="j_jds1132_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Boja</surname> <given-names>C</given-names></string-name>, <string-name><surname>Pocovnicu</surname> <given-names>A</given-names></string-name>, <string-name><surname>Batagan</surname> <given-names>L</given-names></string-name> (<year>2012</year>). <article-title>Distributed parallel architecture for big data</article-title>. <source><italic>Informatică Economică</italic></source>, <volume>16</volume>(<issue>2</issue>): <fpage>116</fpage>.<ext-link ext-link-type="uri" xlink:href="https://www.ams.org/mathscinet-getitem?mr=2965745">MR2965745</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Boyd</surname> <given-names>S</given-names></string-name>, <string-name><surname>Parikh</surname> <given-names>N</given-names></string-name>, <string-name><surname>Chu</surname> <given-names>E</given-names></string-name>, <string-name><surname>Peleato</surname> <given-names>B</given-names></string-name>, <string-name><surname>Eckstein</surname> <given-names>J</given-names></string-name>, <etal>et al.</etal> (<year>2011</year>). <article-title>Distributed optimization and statistical learning via the alternating direction method of multipliers</article-title>. <source><italic>Foundations and Trends</italic>® <italic>in Machine Learning</italic></source>, <volume>3</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>122</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1561/2200000016" xlink:type="simple">https://doi.org/10.1561/2200000016</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_003">
<mixed-citation publication-type="other"> <string-name><surname>Cairns</surname> <given-names>J</given-names></string-name> (<year>2024</year>). A Platform for Large-Scale Statistical Modelling in R, Ph.D. thesis, University of Auckland.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_004">
<mixed-citation publication-type="other"> <string-name><surname>Eddelbuettel</surname> <given-names>D</given-names></string-name> (<year>2024</year>). CRAN task view: High-performance and parallel computing with r.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_005">
<mixed-citation publication-type="book"> <string-name><surname>Gordon</surname> <given-names>MJC</given-names></string-name> (<year>1984</year>). <source><italic>The Denotational Description of Programming Languages</italic></source>. <edition>1</edition>st edition. <publisher-name>Springer</publisher-name>, <publisher-loc>New York, NY</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Kane</surname> <given-names>MJ</given-names></string-name>, <string-name><surname>Emerson</surname> <given-names>J</given-names></string-name>, <string-name><surname>Weston</surname> <given-names>S</given-names></string-name> (<year>2013</year>). <article-title>Scalable strategies for computing with massive data</article-title>. <source><italic>Journal of Statistical Software</italic></source>, <volume>55</volume>(<issue>14</issue>): <fpage>1</fpage>–<lpage>19</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v055.i14" xlink:type="simple">https://doi.org/10.18637/jss.v055.i14</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_007">
<mixed-citation publication-type="book"> <string-name><surname>Kleppmann</surname> <given-names>M</given-names></string-name> (<year>2017</year>). <source><italic>Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems</italic></source>. <publisher-name>O’Reilly Media, Inc.</publisher-name></mixed-citation>
</ref>
<ref id="j_jds1132_ref_008">
<mixed-citation publication-type="other"> <string-name><surname>Luraschi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kuo</surname> <given-names>K</given-names></string-name>, <string-name><surname>Ushey</surname> <given-names>K</given-names></string-name>, <string-name><surname>Allaire</surname> <given-names>J</given-names></string-name> (<year>2020</year>). <italic>Sparklyr: R interface to Apache Spark</italic>. R package version 1.1.0.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Mateos</surname> <given-names>G</given-names></string-name>, <string-name><surname>Bazerque</surname> <given-names>JA</given-names></string-name>, <string-name><surname>Giannakis</surname> <given-names>GB</given-names></string-name> (<year>2010</year>). <article-title>Distributed sparse linear regression</article-title>. <source><italic>IEEE Transactions on Signal Processing</italic></source>, <volume>58</volume>(<issue>10</issue>): <fpage>5262</fpage>–<lpage>5276</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TSP.2010.2055862" xlink:type="simple">https://doi.org/10.1109/TSP.2010.2055862</ext-link><ext-link ext-link-type="uri" xlink:href="https://www.ams.org/mathscinet-getitem?mr=2722673">MR2722673</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Pike</surname> <given-names>R</given-names></string-name> (<year>2012</year>). Concurrency is not parallelism. Heroku.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_011">
<mixed-citation publication-type="book"> <string-name><surname>Quine</surname> <given-names>WV</given-names></string-name> (<year>1979</year>). <source><italic>Mathematical Logic</italic></source>. <publisher-name>Harvard University Press</publisher-name>, <publisher-loc>London, England</publisher-loc>.<ext-link ext-link-type="uri" xlink:href="https://www.ams.org/mathscinet-getitem?mr=0695499">MR0695499</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>Schmidt</surname> <given-names>D</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>WC</given-names></string-name>, <string-name><surname>de la Chapelle</surname> <given-names>SL</given-names></string-name>, <string-name><surname>Ostrouchov</surname> <given-names>G</given-names></string-name>, <string-name><surname>Patel</surname> <given-names>P</given-names></string-name> (<year>2020</year>). pbdBASE: pbdR base wrappers for distributed matrices. R package version 0.5-3.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>Shvachko</surname> <given-names>K</given-names></string-name>, <string-name><surname>Kuang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Radia</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chansler</surname> <given-names>R</given-names></string-name> (<year>2010</year>). <chapter-title>The Hadoop distributed file system</chapter-title>. In: <source><italic>2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)</italic></source>, <fpage>1</fpage>–<lpage>10</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>Weston</surname> <given-names>S</given-names></string-name> (<year>2017</year>). doMPI: Foreach parallel adaptor for the Rmpi package. R package version 0.2.2.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_015">
<mixed-citation publication-type="other"> <string-name><surname>Weston</surname> <given-names>S</given-names></string-name> (<year>2019</year>a). doParallel: Foreach parallel adaptor for the ‘Parallel’ package. R package version 1.0.15.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_016">
<mixed-citation publication-type="other"> <string-name><surname>Weston</surname> <given-names>S</given-names></string-name> (<year>2019</year>b). doSNOW: Foreach parallel adaptor for the ‘SNOW’ package. R package version 1.0.18.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_017">
<mixed-citation publication-type="book"> <string-name><surname>Weston</surname> <given-names>S</given-names></string-name> (<year>2020</year>). <source><italic>Foreach: Provides Foreach Looping Construct</italic></source>. <comment>R package version 1.4.8</comment>.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_018">
<mixed-citation publication-type="chapter"> <string-name><surname>Zaharia</surname> <given-names>M</given-names></string-name>, <string-name><surname>Chowdhury</surname> <given-names>M</given-names></string-name>, <string-name><surname>Das</surname> <given-names>T</given-names></string-name>, <string-name><surname>Dave</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>J</given-names></string-name>, <string-name><surname>McCauly</surname> <given-names>M</given-names></string-name>, <etal>et al.</etal> (<year>2012</year>). <chapter-title>Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing</chapter-title>. In: <source><italic>9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)</italic></source>, <fpage>15</fpage>–<lpage>28</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1132_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Zaharia</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xin</surname> <given-names>RS</given-names></string-name>, <string-name><surname>Wendell</surname> <given-names>P</given-names></string-name>, <string-name><surname>Das</surname> <given-names>T</given-names></string-name>, <string-name><surname>Armbrust</surname> <given-names>M</given-names></string-name>, <string-name><surname>Dave</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal> (<year>2016</year>). <article-title>Apache Spark: A unified engine for big data processing</article-title>. <source><italic>Communications of the ACM</italic></source>, <volume>59</volume>(<issue>11</issue>): <fpage>56</fpage>–<lpage>65</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2934664" xlink:type="simple">https://doi.org/10.1145/2934664</ext-link></mixed-citation>
</ref>
<ref id="j_jds1132_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>Zeng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Breheny</surname> <given-names>P</given-names></string-name> (<year>2017</year>). The biglasso package: A memory-and computation-efficient solver for lasso model fitting with big data in R. arXiv preprint: <uri>https://arxiv.org/abs/1701.05936</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
