<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1156</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1156</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Computing in Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhang</surname><given-names>Mingxuan</given-names></name><xref ref-type="aff" rid="j_jds1156_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Sun</surname><given-names>Yan</given-names></name><xref ref-type="aff" rid="j_jds1156_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Liang</surname><given-names>Faming</given-names></name><email xlink:href="mailto:fmliang@purdue.edu">fmliang@purdue.edu</email><xref ref-type="aff" rid="j_jds1156_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1156_aff_001"><label>1</label>Department of Statistics, <institution>Purdue University</institution>, West Lafayette, IN 47907, USA</aff>
<aff id="j_jds1156_aff_002"><label>2</label>Department of Biostatistics, Epidemiology, and Informatics, <institution>University of Pennsylvania</institution>, Pennsylvania, PA 19104, USA</aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:fmliang@purdue.edu">fmliang@purdue.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2026</year></pub-date><pub-date pub-type="epub"><day>26</day><month>11</month><year>2024</year></pub-date><volume>24</volume><issue>1</issue><fpage>218</fpage><lpage>238</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1156_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The supplementary material includes (i) a brief description for the prior annealing algorithm, (ii) detailed experimental settings, and (iii) a folder (code) which contains all the code for the proposed algorithm <monospace>MGPP</monospace> as well as the code to reproduce the experiments.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>11</day><month>7</month><year>2024</year></date><date date-type="accepted"><day>6</day><month>10</month><year>2024</year></date></history>
<permissions><copyright-statement>2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model’s expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>consistency</kwd>
<kwd>large language model</kwd>
<kwd>sparsity</kwd>
<kwd>stochastic transformer</kwd>
<kwd>transformer</kwd>
</kwd-group>
<funding-group><funding-statement>Liang’s research is support in part by the NSF grants DMS-2015498 and DMS-2210819, and the NIH grant R01-GM152717.</funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1156_reflist_001">
<title>References</title>
<ref id="j_jds1156_ref_001">
<mixed-citation publication-type="chapter"> <string-name><surname>Brown</surname> <given-names>T</given-names></string-name>, <string-name><surname>Mann</surname> <given-names>B</given-names></string-name>, <string-name><surname>Ryder</surname> <given-names>N</given-names></string-name>, <string-name><surname>Subbiah</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kaplan</surname> <given-names>JD</given-names></string-name>, <string-name><surname>Dhariwal</surname> <given-names>P</given-names></string-name>, <etal>et al.</etal> (<year>2020</year>). <chapter-title>Language models are few-shot learners</chapter-title>. In: <string-name>Larochelle H</string-name>, <string-name>Ranzato M</string-name>, <string-name>Hadsell R</string-name>, <string-name>Balcan M-F</string-name>, <string-name>Lin H-T</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 33</italic></source>: <fpage>1877</fpage>–<lpage>1901</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_002">
<mixed-citation publication-type="other"> <string-name><surname>Cer</surname> <given-names>D</given-names></string-name>, <string-name><surname>Diab</surname> <given-names>M</given-names></string-name>, <string-name><surname>Agirre</surname> <given-names>E</given-names></string-name>, <string-name><surname>Lopez-Gazpio</surname> <given-names>I</given-names></string-name>, <string-name><surname>Specia</surname> <given-names>L</given-names></string-name> (<year>2017</year>). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint: <uri>https://arxiv.org/abs/1708.00055</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Chen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Frankle</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Z</given-names></string-name>, <etal>et al.</etal> (<year>2020</year>). <chapter-title>The lottery ticket hypothesis for pre-trained bert networks</chapter-title>. In: <string-name>Larochelle H</string-name>, <string-name>Ranzato M</string-name>, <string-name>Hadsell R</string-name>, <string-name>Balcan M-F</string-name>, <string-name>Lin H-T</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 33</italic></source>: <fpage>15834</fpage>–<lpage>15846</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_004">
<mixed-citation publication-type="chapter"> <string-name><surname>Dagan</surname> <given-names>I</given-names></string-name>, <string-name><surname>Glickman</surname> <given-names>O</given-names></string-name>, <string-name><surname>Magnini</surname> <given-names>B</given-names></string-name> (<year>2006</year>). <chapter-title>The Pascal recognising textual entailment challenge</chapter-title>. In: <string-name><surname>Quiñonero-Candela</surname> <given-names>J</given-names></string-name>, <string-name><surname>Dagan</surname> <given-names>I</given-names></string-name>, <string-name><surname>Magnini</surname> <given-names>B</given-names></string-name>, <string-name><surname>d’Alché-Buc</surname> <given-names>F</given-names></string-name> (eds) <source><italic>Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment</italic></source>, <fpage>177</fpage>–<lpage>190</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_005">
<mixed-citation publication-type="other"> <string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name> (<year>2019</year>). Bert: Pre-training of deep bidirectional transformers for language understanding.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_006">
<mixed-citation publication-type="chapter"> <string-name><surname>Ding</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Guo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Han</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <etal>et al.</etal> (<year>2019</year>). <chapter-title>Global sparse momentum sgd for pruning very deep neural networks</chapter-title>. In: <string-name>Wallac HM</string-name>, <string-name>Larochelle H</string-name>, <string-name>Beygelzimer A</string-name>, <string-name>d’Alché-Buc F</string-name>, <string-name>Fox, EB</string-name>, and <string-name>Garnett, R</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems, 32</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_007">
<mixed-citation publication-type="chapter"> <string-name><surname>Dolan</surname> <given-names>B</given-names></string-name>, <string-name><surname>Brockett</surname> <given-names>C</given-names></string-name> (<year>2005</year>). <chapter-title>Automatically constructing a corpus of sentential paraphrases</chapter-title>. In: <source><italic>Third International Workshop on Paraphrasing (IWP2005)</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_008">
<mixed-citation publication-type="other"> <string-name><surname>Frankle</surname> <given-names>J</given-names></string-name>, <string-name><surname>Carbin</surname> <given-names>M</given-names></string-name> (<year>2018</year>). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint: <uri>https://arxiv.org/abs/1803.03635</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_009">
<mixed-citation publication-type="chapter"> <string-name><surname>Frantar</surname> <given-names>E</given-names></string-name>, <string-name><surname>Kurtic</surname> <given-names>E</given-names></string-name>, <string-name><surname>Alistarh</surname> <given-names>D</given-names></string-name> (<year>2021</year>). <chapter-title>M-fac: Efficient matrix-free approximations of second-order information</chapter-title>. In: <string-name>Ranzato M</string-name>, <string-name>Beygelzimer A</string-name>, <string-name>Dauphin YN</string-name>, <string-name>Liang P</string-name>, <string-name>Vaughan JW</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 34</italic></source>: <fpage>14873</fpage>–<lpage>14886</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Han</surname> <given-names>S</given-names></string-name>, <string-name><surname>Mao</surname> <given-names>H</given-names></string-name>, <string-name><surname>Dally</surname> <given-names>WJ</given-names></string-name> (<year>2015</year>a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint: <uri>https://arxiv.org/abs/1510.00149</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_011">
<mixed-citation publication-type="chapter"> <string-name><surname>Han</surname> <given-names>S</given-names></string-name>, <string-name><surname>Pool</surname> <given-names>J</given-names></string-name>, <string-name><surname>Tran</surname> <given-names>J</given-names></string-name>, <string-name><surname>Dally</surname> <given-names>W</given-names></string-name> (<year>2015b</year>). <chapter-title>Learning both weights and connections for efficient neural network</chapter-title>. In: <string-name>Cortes C</string-name>, <string-name>Lawrence ND</string-name>, <string-name>Lee DD</string-name>, <string-name>Sugiyama M</string-name>, <string-name>Garnett R</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems, 28</italic></source>: <fpage>1135</fpage>–<lpage>1143</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>He</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>W</given-names></string-name> (<year>2021</year>). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: <uri>https://arxiv.org/abs/2111.09543</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>Hermann</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Kocisky</surname> <given-names>T</given-names></string-name>, <string-name><surname>Grefenstette</surname> <given-names>E</given-names></string-name>, <string-name><surname>Espeholt</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kay</surname> <given-names>W</given-names></string-name>, <string-name><surname>Suleyman</surname> <given-names>M</given-names></string-name>, <etal>et al.</etal> (<year>2015</year>). <chapter-title>Teaching machines to read and comprehend</chapter-title>. In: <string-name>Cortes C</string-name>, <string-name>Lawrence ND</string-name>, <string-name>Lee DD</string-name>, <string-name>Sugiyama M</string-name>, <string-name>Garnett R</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems, 28</italic></source>: <fpage>1693</fpage>–<lpage>1701</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_014">
<mixed-citation publication-type="other"> <string-name><surname>Kim</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2024</year>). Narrow and deep neural networks achieve feature learning consistency.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_015">
<mixed-citation publication-type="other"> <string-name><surname>Kurtic</surname> <given-names>E</given-names></string-name>, <string-name><surname>Campos</surname> <given-names>D</given-names></string-name>, <string-name><surname>Nguyen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Frantar</surname> <given-names>E</given-names></string-name>, <string-name><surname>Kurtz</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fineran</surname> <given-names>B</given-names></string-name>, et al. (<year>2022</year>). The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint: <uri>https://arxiv.org/abs/2203.07259</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>LeCun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Denker</surname> <given-names>J</given-names></string-name>, <string-name><surname>Solla</surname> <given-names>S</given-names></string-name> (<year>1989</year>). <chapter-title>Optimal brain damage</chapter-title>. In: <string-name>Touretzky DS</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems, 2</italic></source>: <fpage>598</fpage>–<lpage>605</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_017">
<mixed-citation publication-type="other"> <string-name><surname>Lee</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ajanthan</surname> <given-names>T</given-names></string-name>, <string-name><surname>Torr</surname> <given-names>PH</given-names></string-name> (<year>2018</year>). Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint: <uri>https://arxiv.org/abs/1810.02340</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_018">
<mixed-citation publication-type="chapter"> <string-name><surname>Levesque</surname> <given-names>H</given-names></string-name>, <string-name><surname>Davis</surname> <given-names>E</given-names></string-name>, <string-name><surname>Morgenstern</surname> <given-names>L</given-names></string-name> (<year>2012</year>). <chapter-title>The winograd schema challenge</chapter-title>. In: <string-name>Brewka G</string-name>, <string-name>Eiter T</string-name>, <string-name>McIlraith SA</string-name> (eds) <source><italic>Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_019">
<mixed-citation publication-type="other"> <string-name><surname>Lewis</surname> <given-names>M</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Goyal</surname> <given-names>N</given-names></string-name>, <string-name><surname>Ghazvininejad</surname> <given-names>M</given-names></string-name>, <string-name><surname>Mohamed</surname> <given-names>A</given-names></string-name>, <string-name><surname>Levy</surname> <given-names>O</given-names></string-name>, et al. (<year>2019</year>). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint: <uri>https://arxiv.org/abs/1910.13461</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>Li</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>C</given-names></string-name>, <string-name><surname>He</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>W</given-names></string-name>, et al. (<year>2023</year>). Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint: <uri>https://arxiv.org/abs/2306.11222</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_021">
<mixed-citation publication-type="other"> <string-name><surname>Liang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>M</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>He</surname> <given-names>P</given-names></string-name>, et al. (<year>2021</year>). Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint: <uri>https://arxiv.org/abs/2105.12002</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_022">
<mixed-citation publication-type="journal"> <string-name><surname>Liang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Jia</surname> <given-names>B</given-names></string-name>, <string-name><surname>Xue</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Luo</surname> <given-names>Y</given-names></string-name> (<year>2018</year>a). <article-title>An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond</article-title>. <source><italic>Journal of the Royal Statistical Society, Series B</italic></source>, <volume>80</volume>(<issue>5</issue>): <fpage>899</fpage>–<lpage>926</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/rssb.12279" xlink:type="simple">https://doi.org/10.1111/rssb.12279</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_023">
<mixed-citation publication-type="journal"> <string-name><surname>Liang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Li</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>L</given-names></string-name> (<year>2018</year>b). <article-title>Bayesian neural networks for selection of drug sensitive genes</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>113</volume>(<issue>523</issue>): <fpage>955</fpage>–<lpage>972</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1409122" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1409122</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_024">
<mixed-citation publication-type="chapter"> <string-name><surname>Liang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2022</year>). <chapter-title>Nonlinear sufficient dimension reduction with a stochastic neural network</chapter-title>. In: <string-name>Koyejo S</string-name>, <string-name>Mohamed S</string-name>, <string-name>Agarwal A</string-name>, <string-name>Belgrave D</string-name>, <string-name>Cho K</string-name>, <string-name>Oh A</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 35</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_025">
<mixed-citation publication-type="chapter"> <string-name><surname>Lin</surname> <given-names>CY</given-names></string-name> (<year>2004</year>). <chapter-title>Rouge: A package for automatic evaluation of summaries</chapter-title>. In: <source><italic>Text Summarization Branches Out</italic></source>, <fpage>74</fpage>–<lpage>81</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_026">
<mixed-citation publication-type="other"> <string-name><surname>Loshchilov</surname> <given-names>I</given-names></string-name>, <string-name><surname>Hutter</surname> <given-names>F</given-names></string-name> (<year>2019</year>). Decoupled weight decay regularization.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_027">
<mixed-citation publication-type="other"> <string-name><surname>Louizos</surname> <given-names>C</given-names></string-name>, <string-name><surname>Welling</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kingma</surname> <given-names>DP</given-names></string-name> (<year>2017</year>). Learning sparse neural networks through <inline-formula id="j_jds1156_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mtext>_</mml:mtext>
<mml:mn>0</mml:mn></mml:math><tex-math><![CDATA[$l\text{\_}0$]]></tex-math></alternatives></inline-formula> regularization. arXiv preprint: <uri>https://arxiv.org/abs/1712.01312</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_028">
<mixed-citation publication-type="chapter"> <string-name><surname>Molchanov</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mallya</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tyree</surname> <given-names>S</given-names></string-name>, <string-name><surname>Frosio</surname> <given-names>I</given-names></string-name>, <string-name><surname>Kautz</surname> <given-names>J</given-names></string-name> (<year>2019</year>). <chapter-title>Importance estimation for neural network pruning</chapter-title>. In: <source><italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic></source>, <fpage>11264</fpage>–<lpage>11272</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_029">
<mixed-citation publication-type="other"> <string-name><surname>Narayan</surname> <given-names>S</given-names></string-name>, <string-name><surname>Cohen</surname> <given-names>SB</given-names></string-name>, <string-name><surname>Lapata</surname> <given-names>M</given-names></string-name> (<year>2018</year>). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint: <uri>https://arxiv.org/abs/1808.08745</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_030">
<mixed-citation publication-type="journal"> <string-name><surname>Portnoy</surname> <given-names>S</given-names></string-name> (<year>1988</year>). <article-title>Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity</article-title>. <source><italic>The Annals of Statistics</italic></source>, <volume>16</volume>(<issue>1</issue>): <fpage>356</fpage>–<lpage>366</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/aos/1176350710" xlink:type="simple">https://doi.org/10.1214/aos/1176350710</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_031">
<mixed-citation publication-type="journal"> <string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Child</surname> <given-names>R</given-names></string-name>, <string-name><surname>Luan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Amodei</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sutskever</surname> <given-names>I</given-names></string-name>, <etal>et al.</etal> (<year>2019</year>). <article-title>Language models are unsupervised multitask learners</article-title>. <source><italic>OpenAI blog</italic></source>, <volume>1</volume>(<issue>8</issue>): <fpage>9</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_032">
<mixed-citation publication-type="other"> <string-name><surname>Rajpurkar</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lopyrev</surname> <given-names>K</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>P</given-names></string-name> (<year>2016</year>). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint: <uri>https://arxiv.org/abs/1606.05250</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_033">
<mixed-citation publication-type="journal"> <string-name><surname>Sanh</surname> <given-names>V</given-names></string-name>, <string-name><surname>Wolf</surname> <given-names>T</given-names></string-name>, <string-name><surname>Rush</surname> <given-names>A</given-names></string-name> (<year>2020</year>). <chapter-title>Movement pruning: Adaptive sparsity by fine-tuning</chapter-title>. In: <string-name>Larochelle H</string-name>, <string-name>Ranzato M</string-name>, <string-name>Hadsell R</string-name>, <string-name>Balcan M-F</string-name>, <string-name>Lin H-T</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 33</italic></source>: <fpage>20378</fpage>–<lpage>20389</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_034">
<mixed-citation publication-type="journal"> <string-name><surname>Singh</surname> <given-names>SP</given-names></string-name>, <string-name><surname>Alistarh</surname> <given-names>D</given-names></string-name> (<year>2020</year>). <chapter-title>Woodfisher: Efficient second-order approximation for neural network compression</chapter-title>. In: <string-name>Larochelle H</string-name>, <string-name>Ranzato M</string-name>, <string-name>Hadsell R</string-name>, <string-name>Balcan M-F</string-name>, <string-name>Lin H-T</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 33</italic></source>: <fpage>18098</fpage>–<lpage>18109</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_035">
<mixed-citation publication-type="chapter"> <string-name><surname>Socher</surname> <given-names>R</given-names></string-name>, <string-name><surname>Perelygin</surname> <given-names>A</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chuang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Manning</surname> <given-names>CD</given-names></string-name>, <string-name><surname>Ng</surname> <given-names>AY</given-names></string-name>, <etal>et al.</etal> (<year>2013</year>). <chapter-title>Recursive deep models for semantic compositionality over a sentiment treebank</chapter-title>. In: <source><italic>Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</italic></source>, <fpage>1631</fpage>–<lpage>1642</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_036">
<mixed-citation publication-type="journal"> <string-name><surname>Song</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2022</year>). <article-title>Nearly optimal Bayesian shrinkage for high-dimensional regression</article-title>. <source><italic>Science China Mathematics</italic></source>, <volume>66</volume>: <fpage>409</fpage>–<lpage>442</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s11425-020-1912-6" xlink:type="simple">https://doi.org/10.1007/s11425-020-1912-6</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_037">
<mixed-citation publication-type="other"> <string-name><surname>Strubell</surname> <given-names>E</given-names></string-name>, <string-name><surname>Ganesh</surname> <given-names>A</given-names></string-name>, <string-name><surname>McCallum</surname> <given-names>A</given-names></string-name> (<year>2020</year>). Energy and policy considerations for deep learning in nlp. 2019, arXiv preprint: <uri>https://arxiv.org/abs/1906.02243</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_038">
<mixed-citation publication-type="journal"> <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2022</year>). <article-title>A kernel-expanded stochastic neural network</article-title>. <source><italic>Journal of the Royal Statistical Society Series B</italic></source>, <volume>84</volume>(<issue>2</issue>): <fpage>547</fpage>–<lpage>578</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/rssb.12496" xlink:type="simple">https://doi.org/10.1111/rssb.12496</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_039">
<mixed-citation publication-type="journal"> <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2022</year>a). <article-title>Consistent sparse deep learning: Theory and computation</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>117</volume>: <fpage>1981</fpage>–<lpage>1995</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2021.1895175" xlink:type="simple">https://doi.org/10.1080/01621459.2021.1895175</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_040">
<mixed-citation publication-type="journal"> <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Song</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2022</year>b). <article-title>Learning sparse deep neural networks with a spike-and-slab prior</article-title>. <source><italic>Statistics &amp; Probability Letters</italic></source>, <volume>180</volume>: <fpage>109246</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.spl.2021.109246" xlink:type="simple">https://doi.org/10.1016/j.spl.2021.109246</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_041">
<mixed-citation publication-type="journal"> <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2021</year>). <article-title>Sparse deep learning: A new framework immune to local traps and miscalibration</article-title>. In: <string-name>Ranzato M</string-name>, <string-name>Beygelzimer A</string-name>, <string-name>Dauphin YN</string-name>, <string-name>Liang P</string-name>, <string-name>Vaughan JW</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 34</italic></source>: <fpage>22301</fpage>–<lpage>22312</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_042">
<mixed-citation publication-type="other"> <string-name><surname>Thickstun</surname> <given-names>J</given-names></string-name> (<year>2020</year>). The transformer model in equations. <uri>https://johnthickstun.com/docs/transformers.pdf</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_043">
<mixed-citation publication-type="other"> <string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Martin</surname> <given-names>L</given-names></string-name>, <string-name><surname>Stone</surname> <given-names>K</given-names></string-name>, <string-name><surname>Albert</surname> <given-names>P</given-names></string-name>, <string-name><surname>Almahairi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Babaei</surname> <given-names>Y</given-names></string-name>, et al. (<year>2023</year>). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint: <uri>https://arxiv.org/abs/2307.09288</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_044">
<mixed-citation publication-type="other"> <string-name><surname>Wang</surname> <given-names>A</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Michael</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hill</surname> <given-names>F</given-names></string-name>, <string-name><surname>Levy</surname> <given-names>O</given-names></string-name>, <string-name><surname>Bowman</surname> <given-names>SR</given-names></string-name> (<year>2018</year>). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint: <uri>https://arxiv.org/abs/1804.07461</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_045">
<mixed-citation publication-type="other"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Qin</surname> <given-names>C</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Fu</surname> <given-names>Y</given-names></string-name> (<year>2020</year>). Neural pruning via growing regularization. arXiv preprint: <uri>https://arxiv.org/abs/2012.09243</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_046">
<mixed-citation publication-type="journal"> <string-name><surname>Warstadt</surname> <given-names>A</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bowman</surname> <given-names>SR</given-names></string-name> (<year>2019</year>). <article-title>Neural network acceptability judgments</article-title>. <source><italic>Transactions of the Association for Computational Linguistics</italic></source>, <volume>7</volume>: <fpage>625</fpage>–<lpage>641</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1162/tacl_a_00290" xlink:type="simple">https://doi.org/10.1162/tacl_a_00290</ext-link></mixed-citation>
</ref>
<ref id="j_jds1156_ref_047">
<mixed-citation publication-type="other"> <string-name><surname>Williams</surname> <given-names>A</given-names></string-name>, <string-name><surname>Nangia</surname> <given-names>N</given-names></string-name>, <string-name><surname>Bowman</surname> <given-names>SR</given-names></string-name> (<year>2017</year>). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint: <uri>https://arxiv.org/abs/1704.05426</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_048">
<mixed-citation publication-type="chapter"> <string-name><surname>Wolf</surname> <given-names>T</given-names></string-name>, <string-name><surname>Debut</surname> <given-names>L</given-names></string-name>, <string-name><surname>Sanh</surname> <given-names>V</given-names></string-name>, <string-name><surname>Chaumond</surname> <given-names>J</given-names></string-name>, <string-name><surname>Delangue</surname> <given-names>C</given-names></string-name>, <string-name><surname>Moi</surname> <given-names>A</given-names></string-name>, <etal>et al.</etal> (<year>2020</year>). <chapter-title>Transformers: State-of-the-art natural language processing</chapter-title>. In: <string-name>Liu Q</string-name>, <string-name>Schlangen D</string-name> (eds) <source><italic>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</italic></source>, <fpage>38</fpage>–<lpage>45</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>, Online.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_049">
<mixed-citation publication-type="other"> <string-name><surname>Zafrir</surname> <given-names>O</given-names></string-name>, <string-name><surname>Larey</surname> <given-names>A</given-names></string-name>, <string-name><surname>Boudoukh</surname> <given-names>G</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wasserblat</surname> <given-names>M</given-names></string-name> (<year>2021</year>). Prune once for all: Sparse pre-trained language models. arXiv preprint: <uri>https://arxiv.org/abs/2111.05754</uri>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_050">
<mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>F</given-names></string-name> (<year>2023</year>). <article-title>Sparse deep learning for time series: Theory and Applications</article-title>. In: <string-name>Oh A</string-name>, <string-name>Naumann T</string-name>, <string-name>Globerson A</string-name>, <string-name>Saenko K</string-name>, <string-name>Levine S</string-name> (eds) <source><italic>Advances in Neural Information Processing Systems 35</italic></source>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_051">
<mixed-citation publication-type="chapter"> <string-name><surname>Zhang</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zuo</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>C</given-names></string-name>, <string-name><surname>Bukharin</surname> <given-names>A</given-names></string-name>, <string-name><surname>He</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>W</given-names></string-name>, <etal>et al.</etal> (<year>2022</year>). Kamalika Chaudhuri and Stefanie Jegelka and Le Song and Csaba Szepesvári and Gang Niu and Sivan Sabato, <chapter-title>Platon: Pruning large transformer models with upper confidence bound of weight importance</chapter-title>. In: <string-name>Chaudhuri K</string-name>, <string-name>Jegelka S</string-name>, <string-name>Song L</string-name>, <string-name>Szepesvári C</string-name>, <string-name>Niu G</string-name>, <string-name>Sabato S</string-name> (eds) <source><italic>International Conference on Machine Learning</italic></source>: <fpage>26809</fpage>–<lpage>26823</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1156_ref_052">
<mixed-citation publication-type="other"> <string-name><surname>Zhu</surname> <given-names>M</given-names></string-name>, <string-name><surname>Gupta</surname> <given-names>S</given-names></string-name> (<year>2017</year>). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint: <uri>https://arxiv.org/abs/1710.01878</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
