<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn>
<issn pub-type="ppub">1680-743X</issn>
<issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS999</article-id>
<article-id pub-id-type="doi">10.6339/21-JDS999</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Data Science Review</subject></subj-group></article-categories>
<title-group>
<article-title>A Review on Optimal Subsampling Methods for Massive Datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yao</surname><given-names>Yaqiong</given-names></name><xref ref-type="aff" rid="j_jds999_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>HaiYing</given-names></name><email xlink:href="mailto:haiying.wang@uconn.edu">haiying.wang@uconn.edu</email><xref ref-type="aff" rid="j_jds999_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds999_aff_001"><label>1</label>Department of Statistics, <institution>University of Connecticut</institution>, Storrs, CT, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:haiying.wang@uconn.edu">haiying.wang@uconn.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2021</year></pub-date><pub-date pub-type="epub"><day>28</day><month>1</month><year>2021</year></pub-date>
<volume>19</volume><issue>1</issue><fpage>151</fpage><lpage>172</lpage>
<supplementary-material id="S1" content-type="archive" xlink:href="jds999_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The <sans-serif>R</sans-serif> functions mentioned in the paper for the optimal subsampling algorithms and all datasets can be found on the <italic>Journal of Data Science</italic> website.</p>
</caption>
</supplementary-material>
<history>
<date date-type="received"><month>9</month><year>2020</year></date>
<date date-type="accepted"><month>10</month><year>2020</year></date>
</history>
<permissions><copyright-statement>2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2021</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>Asymptotic mean squared error</kwd>
<kwd>big data</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds999_reflist_001">
<title>References</title>
<ref id="j_jds999_ref_001">
<mixed-citation publication-type="other"> <string-name><surname>Ai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. Forthcoming, <uri>https://doi.org/10.5705/ss.202018.0439</uri>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Cheng</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>M</given-names></string-name> (<year>2020</year>). <article-title>Information-based optimal subdata selection for big data logistic regression</article-title>. <source>Journal of Statistical Planning and Inference</source>, <volume>209</volume>: <fpage>112</fpage>–<lpage>122</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Derezinski</surname> <given-names>M</given-names></string-name>, <string-name><surname>Warmuth</surname> <given-names>MKK</given-names></string-name>, <string-name><surname>Hsu</surname> <given-names>DJ</given-names></string-name> (<year>2018</year>). <chapter-title>Leveraged volume sampling for linear regression</chapter-title>. In: <source>Advances in Neural Information Processing Systems</source> (<string-name><given-names>S</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>H</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>H</given-names> <surname>Larochelle</surname></string-name>, <string-name><given-names>K</given-names> <surname>Grauman</surname></string-name>, <string-name><given-names>N</given-names> <surname>Cesa-Bianchi</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), volume <volume>31</volume>, <fpage>2505</fpage>–<lpage>2514</lpage>. <publisher-name>Curran Associates, Inc.</publisher-name></mixed-citation>
</ref>
<ref id="j_jds999_ref_004">
<mixed-citation publication-type="journal"> <string-name><surname>Drineas</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>M</given-names></string-name>, <string-name><surname>Muthukrishnan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sarlos</surname> <given-names>T</given-names></string-name> (<year>2011</year>). <article-title>Faster least squares approximation</article-title>. <source>Numerische Mathematik</source>, <volume>117</volume>: <fpage>219</fpage>–<lpage>249</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_005">
<mixed-citation publication-type="chapter"> <string-name><surname>Drineas</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Muthukrishnan</surname> <given-names>S</given-names></string-name> (<year>2006</year>). <chapter-title>Sampling algorithms for <italic>l</italic><sub>2</sub> regression and applications</chapter-title>. In: <source>Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, SODA ’06</source>, <fpage>1127</fpage>–<lpage>1136</lpage>. <publisher-name>Society for Industrial and Applied Mathematics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_006">
<mixed-citation publication-type="other"> <string-name><surname>Dua</surname> <given-names>D</given-names></string-name>, <string-name><surname>Graff</surname> <given-names>C</given-names></string-name> (2017). UCI machine learning repository.</mixed-citation>
</ref>
<ref id="j_jds999_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Fanaee-T</surname> <given-names>H</given-names></string-name>, <string-name><surname>Gama</surname> <given-names>J</given-names></string-name> (<year>2014</year>). <article-title>Event labeling combining ensemble detectors and background knowledge</article-title>. <source>Progress in Artificial Intelligence</source>, <volume>2</volume>: <fpage>113</fpage>–<lpage>127</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Fithian</surname> <given-names>W</given-names></string-name>, <string-name><surname>Hastie</surname> <given-names>T</given-names></string-name> (<year>2014</year>). <article-title>Local case-control sampling: Efficient subsampling in imbalanced data sets</article-title>. <source>Annals of statistics</source>, <volume>42</volume>(<issue>5</issue>): <fpage>1693</fpage>–<lpage>1724</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Han</surname> <given-names>L</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>T</given-names></string-name> (<year>2020</year>). <article-title>Local uncertainty sampling for large-scale multiclass logistic regression</article-title>. <source>Annals of Statistics</source>, <volume>48</volume>(<issue>3</issue>): <fpage>1770</fpage>–<lpage>1788</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Koenker</surname> <given-names>R</given-names></string-name> (2020). quantreg: Quantile Regression. R package version 5.55.</mixed-citation>
</ref>
<ref id="j_jds999_ref_011">
<mixed-citation publication-type="journal"> <string-name><surname>Lin</surname> <given-names>N</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>R</given-names></string-name> (<year>2011</year>). <article-title>Aggregated estimating equation estimation</article-title>. <source>Statistics and Its Interface</source>, <volume>4</volume>: <fpage>73</fpage>–<lpage>83</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>Lumley</surname> <given-names>T</given-names></string-name> (2020). survey: Analysis of Complex Survey Samples. R package version 4.0.</mixed-citation>
</ref>
<ref id="j_jds999_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>Ma</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>B</given-names></string-name> (<year>2015</year>). <article-title>A statistical perspective on algorithmic leveraging</article-title>. <source>Journal of Machine Learning Research</source>, <volume>16</volume>(<issue>1</issue>): <fpage>861</fpage>–<lpage>911</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_014">
<mixed-citation publication-type="journal"> <string-name><surname>Ma</surname> <given-names>P</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>X</given-names></string-name> (<year>2015</year>). <article-title>Leveraging for big data regression</article-title>. <source>Wiley Interdisciplinary Reviews: Computational Statistics</source>, <volume>7</volume>(<issue>1</issue>): <fpage>70</fpage>–<lpage>76</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_015">
<mixed-citation publication-type="chapter"> <string-name><surname>Ma</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>X</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>J</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>M</given-names></string-name> (<year>2020</year>). <chapter-title>Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms</chapter-title>. In: <source>Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics</source> (<string-name><given-names>S</given-names> <surname>Chiappa</surname></string-name>, <string-name><given-names>R</given-names> <surname>Calandra</surname></string-name>, eds.), volume <volume>108</volume> of <series><italic>Proceedings of Machine Learning Research</italic></series>, <fpage>1026</fpage>–<lpage>1035</lpage>. <publisher-name>PMLR, Online</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_016">
<mixed-citation publication-type="journal"> <string-name><surname>Mahoney</surname> <given-names>MW</given-names></string-name> (<year>2011</year>). <article-title>Randomized algorithms for matrices and data</article-title>. <source><italic>Foundations and Trends</italic>® <italic>in Machine Learning</italic></source>, <volume>3</volume>(<issue>2</issue>): <fpage>123</fpage>–<lpage>224</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Portnoy</surname> <given-names>S</given-names></string-name>, <string-name><surname>Koenker</surname> <given-names>R</given-names></string-name>, <etal>et al.</etal> (<year>1997</year>). <article-title>The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators</article-title>. <source>Statistical Science</source>, <volume>12</volume>(<issue>4</issue>): <fpage>279</fpage>–<lpage>300</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Pronzato</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2021</year>). <article-title>Sequential online subsampling for thinning experimental designs</article-title>. <source>Journal of Statistical Planning and Inference</source>, <volume>212</volume>: <fpage>169</fpage>–<lpage>193</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_019">
<mixed-citation publication-type="book"> <collab>R Core Team</collab> (<year>2020</year>). <source>R: A Language and Environment for Statistical Computing</source>. <publisher-name>R Foundation for Statistical Computing</publisher-name>, <publisher-loc>Vienna, Austria</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Schifano</surname> <given-names>ED</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>MH</given-names></string-name> (<year>2016</year>). <article-title>Online updating of statistical inference in the big data setting</article-title>. <source>Technometrics</source>, <volume>58</volume>(<issue>3</issue>): <fpage>393</fpage>–<lpage>403</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_021">
<mixed-citation publication-type="journal"> <string-name><surname>Toulis</surname> <given-names>P</given-names></string-name>, <string-name><surname>Airoldi</surname> <given-names>EM</given-names></string-name>, <etal>et al.</etal> (<year>2017</year>). <article-title>Asymptotic and finite-sample properties of estimators based on stochastic gradients</article-title>. <source>Annals of Statistics</source>, <volume>45</volume>(<issue>4</issue>): <fpage>1694</fpage>–<lpage>1727</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_022">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2019</year>a). <article-title>Divide-and-conquer information-based optimal subdata selection algorithm</article-title>. <source>Journal of Statistical Theory and Practice</source>, <volume>13</volume>(<issue>3</issue>): <fpage>1</fpage>–<lpage>19</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2019</year>b). <article-title>More efficient estimation for logistic regression with optimal subsamples</article-title>. <source>Journal of Machine Learning Research</source>, <volume>20</volume>(<issue>132</issue>): <fpage>1</fpage>–<lpage>59</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_024">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>Y</given-names></string-name> (<year>2020</year>). <article-title>Optimal subsampling for quantile regression in big data</article-title>. <source>Biometrika</source>, <comment>in press. Forthcoming</comment>, <uri>https://doi.org/10.1093/biomet/asaa043</uri>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_025">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Stufken</surname> <given-names>J</given-names></string-name> (<year>2019</year>). <article-title>Information-based optimal subdata selection for big data linear regression</article-title>. <source>Journal of the American Statistical Association</source>, <volume>114</volume>(<issue>525</issue>): <fpage>393</fpage>–<lpage>405</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_026">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>P</given-names></string-name> (<year>2018</year>). <article-title>Optimal subsampling for large sample logistic regression</article-title>. <source>Journal of the American Statistical Association</source>, <volume>113</volume>(<issue>522</issue>): <fpage>829</fpage>–<lpage>844</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_027">
<mixed-citation publication-type="journal"> <string-name><surname>Yao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2018</year>). <article-title>Optimal subsampling for softmax regression</article-title>. <source>Statistical Papers</source>, <volume>60</volume>: <fpage>585</fpage>–<lpage>599</lpage>.</mixed-citation>
</ref>
<ref id="j_jds999_ref_028">
<mixed-citation publication-type="journal"> <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name> (<year>2020</year>). <article-title>Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data</article-title>. <source>Journal of the American Statistical Association.</source> <comment>Forthcoming</comment>, <uri>https://doi.org/10.1080/01621459.2020.1773832</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
