Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1156

10.6339/24-JDS1156

Computing in Data Science

Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

Zhang

Mingxuan

1 Sun

Yan

2 Liang

Faming

fmliang@purdue.edu1∗ 1Department of Statistics, Purdue University, West Lafayette, IN 47907, USA 2Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Pennsylvania, PA 19104, USA

∗Corresponding author. Email: fmliang@purdue.edu.

2026

26112024

241218238

Supplementary Material

The supplementary material includes (i) a brief description for the prior annealing algorithm, (ii) detailed experimental settings, and (iii) a folder (code) which contains all the code for the proposed algorithm MGPP as well as the code to reproduce the experiments.

11720246102024

2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2026

Open access article under the CC BY license.

Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model’s expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.

Keywords consistency large language model sparsity stochastic transformer transformer

Liang’s research is support in part by the NSF grants DMS-2015498 and DMS-2210819, and the NIH grant R01-GM152717.

References

Brown

, Mann

, Ryder

, Subbiah

, Kaplan

, Dhariwal

, et al. (2020). Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 1877–1901.

Cer

, Diab

, Agirre

, Lopez-Gazpio

, Specia

(2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint: https://arxiv.org/abs/1708.00055.

Chen

, Frankle

, Chang

, Liu

, Zhang

, Wang

, et al. (2020). The lottery ticket hypothesis for pre-trained bert networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 15834–15846.

Dagan

, Glickman

, Magnini

(2006). The Pascal recognising textual entailment challenge. In: Quiñonero-Candela

, Dagan

, Magnini

, d’Alché-Buc

(eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, 177–190. Springer.

Devlin

, Chang

, Lee

, Toutanova

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding.

Ding

, Zhou

, Guo

, Han

, Liu

, et al. (2019). Global sparse momentum sgd for pruning very deep neural networks. In: Wallac HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox, EB, and Garnett, R (eds) Advances in Neural Information Processing Systems, 32.

Dolan

, Brockett

(2005). Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005).

Frankle

, Carbin

(2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint: https://arxiv.org/abs/1803.03635.

Frantar

, Kurtic

, Alistarh

(2021). M-fac: Efficient matrix-free approximations of second-order information. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems 34: 14873–14886.

Han

, Mao

, Dally

(2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint: https://arxiv.org/abs/1510.00149.

Han

, Pool

, Tran

, Dally

(2015b). Learning both weights and connections for efficient neural network. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, 28: 1135–1143.

, Gao

, Chen

(2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.

Hermann

, Kocisky

, Grefenstette

, Espeholt

, Kay

, Suleyman

, et al. (2015). Teaching machines to read and comprehend. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, 28: 1693–1701.

Kim

, Sun

, Liang

(2024). Narrow and deep neural networks achieve feature learning consistency.

Kurtic

, Campos

, Nguyen

, Frantar

, Kurtz

, Fineran

, et al. (2022). The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint: https://arxiv.org/abs/2203.07259.

LeCun

, Denker

, Solla

(1989). Optimal brain damage. In: Touretzky DS (eds) Advances in Neural Information Processing Systems, 2: 598–605.

Lee

, Ajanthan

, Torr

(2018). Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint: https://arxiv.org/abs/1810.02340.

Levesque

, Davis

, Morgenstern

(2012). The winograd schema challenge. In: Brewka G, Eiter T, McIlraith SA (eds) Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

Lewis

, Liu

, Goyal

, Ghazvininejad

, Mohamed

, Levy

, et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint: https://arxiv.org/abs/1910.13461.

, Yu

, Zhang

, Liang

, He

, Chen

, et al. (2023). Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint: https://arxiv.org/abs/2306.11222.

Liang

, Zuo

, Chen

, Jiang

, Liu

, He

, et al. (2021). Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint: https://arxiv.org/abs/2105.12002.

Liang

, Jia

, Xue

, Li

, Luo

(2018a). An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond. Journal of the Royal Statistical Society, Series B, 80(5): 899–926. https://doi.org/10.1111/rssb.12279

Liang

, Li

, Zhou

(2018b). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523): 955–972. https://doi.org/10.1080/01621459.2017.1409122

Liang

, Sun

, Liang

(2022). Nonlinear sufficient dimension reduction with a stochastic neural network. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A (eds) Advances in Neural Information Processing Systems 35.

Lin

(2004). Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, 74–81.

Loshchilov

, Hutter

(2019). Decoupled weight decay regularization.

Louizos

, Welling

, Kingma

(2017). Learning sparse neural networks through l _ 0

regularization. arXiv preprint: https://arxiv.org/abs/1712.01312.

Molchanov

, Mallya

, Tyree

, Frosio

, Kautz

(2019). Importance estimation for neural network pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11264–11272.

Narayan

, Cohen

, Lapata

(2018). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint: https://arxiv.org/abs/1808.08745.

Portnoy

(1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity. The Annals of Statistics, 16(1): 356–366. https://doi.org/10.1214/aos/1176350710

Radford

, Wu

, Child

, Luan

, Amodei

, Sutskever

, et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.

Rajpurkar

, Zhang

, Lopyrev

, Liang

(2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint: https://arxiv.org/abs/1606.05250.

Sanh

, Wolf

, Rush

(2020). Movement pruning: Adaptive sparsity by fine-tuning. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 20378–20389.

Singh

, Alistarh

(2020). Woodfisher: Efficient second-order approximation for neural network compression. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 18098–18109.

Socher

, Perelygin

, Wu

, Chuang

, Manning

, Ng

, et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642.

Song

, Liang

(2022). Nearly optimal Bayesian shrinkage for high-dimensional regression. Science China Mathematics, 66: 409–442. https://doi.org/10.1007/s11425-020-1912-6

Strubell

, Ganesh

, McCallum

(2020). Energy and policy considerations for deep learning in nlp. 2019, arXiv preprint: https://arxiv.org/abs/1906.02243.

Sun

, Liang

(2022). A kernel-expanded stochastic neural network. Journal of the Royal Statistical Society Series B, 84(2): 547–578. https://doi.org/10.1111/rssb.12496

Sun

, Song

, Liang

(2022a). Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, 117: 1981–1995. https://doi.org/10.1080/01621459.2021.1895175

Sun

, Song

, Liang

(2022b). Learning sparse deep neural networks with a spike-and-slab prior. Statistics & Probability Letters, 180: 109246. https://doi.org/10.1016/j.spl.2021.109246

Sun

, Xiong

, Liang

(2021). Sparse deep learning: A new framework immune to local traps and miscalibration. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems 34: 22301–22312.

Thickstun

(2020). The transformer model in equations. https://johnthickstun.com/docs/transformers.pdf.

Touvron

, Martin

, Stone

, Albert

, Almahairi

, Babaei

, et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint: https://arxiv.org/abs/2307.09288.

Wang

, Singh

, Michael

, Hill

, Levy

, Bowman

(2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint: https://arxiv.org/abs/1804.07461.

Wang

, Qin

, Zhang

, Fu

(2020). Neural pruning via growing regularization. arXiv preprint: https://arxiv.org/abs/2012.09243.

Warstadt

, Singh

, Bowman

(2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7: 625–641. https://doi.org/10.1162/tacl_a_00290

Williams

, Nangia

, Bowman

(2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint: https://arxiv.org/abs/1704.05426.

Wolf

, Debut

, Sanh

, Chaumond

, Delangue

, Moi

, et al. (2020). Transformers: State-of-the-art natural language processing. In: Liu Q, Schlangen D (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Association for Computational Linguistics, Online.

Zafrir

, Larey

, Boudoukh

, Shen

, Wasserblat

(2021). Prune once for all: Sparse pre-trained language models. arXiv preprint: https://arxiv.org/abs/2111.05754.

Zhang

, Sun

, Liang

(2023). Sparse deep learning for time series: Theory and Applications. In: Oh A, Naumann T, Globerson A, Saenko K, Levine S (eds) Advances in Neural Information Processing Systems 35.

Zhang

, Zuo

, Liang

, Bukharin

, He

, Chen

, et al. (2022). Kamalika Chaudhuri and Stefanie Jegelka and Le Song and Csaba Szepesvári and Gang Niu and Sivan Sabato, Platon: Pruning large transformer models with upper confidence bound of weight importance. In: Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S (eds) International Conference on Machine Learning: 26809–26823. PMLR.

Zhu

, Gupta

(2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint: https://arxiv.org/abs/1710.01878.