Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

Zhang, Mingxuan; Sun, Yan; Liang, Faming

doi:10.6339/24-JDS1156

Journal of Data Science

Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

Mingxuan Zhang Yan Sun Faming Liang

https://doi.org/10.6339/24-JDS1156

Pub. online: 26 November 2024 Type: Computing In Data Science

Open Access

Received
11 July 2024

Accepted
6 October 2024

Published
26 November 2024

Abstract

Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model’s expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.

Supplementary material

Supplementary Material

The supplementary material includes (i) a brief description for the prior annealing algorithm, (ii) detailed experimental settings, and (iii) a folder (code) which contains all the code for the proposed algorithm MGPP as well as the code to reproduce the experiments.

References

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. (2020). Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 1877–1901.

Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint: https://arxiv.org/abs/1708.00055.

Chen T, Frankle J, Chang S, Liu S, Zhang Y, Wang Z, et al. (2020). The lottery ticket hypothesis for pre-trained bert networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 15834–15846.

Dagan I, Glickman O, Magnini B (2006). The Pascal recognising textual entailment challenge. In: Quiñonero-Candela J, Dagan I, Magnini B, d’Alché-Buc F (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, 177–190. Springer.

Devlin J, Chang MW, Lee K, Toutanova K (2019). Bert: Pre-training of deep bidirectional transformers for language understanding.

Ding X, Zhou X, Guo Y, Han J, Liu J, et al. (2019). Global sparse momentum sgd for pruning very deep neural networks. In: Wallac HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox, EB, and Garnett, R (eds) Advances in Neural Information Processing Systems, 32.

Dolan B, Brockett C (2005). Automatically constructing a corpus of sentential paraphrases. In: Third International Workshop on Paraphrasing (IWP2005).

Frankle J, Carbin M (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint: https://arxiv.org/abs/1803.03635.

Frantar E, Kurtic E, Alistarh D (2021). M-fac: Efficient matrix-free approximations of second-order information. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems 34: 14873–14886.

Han S, Mao H, Dally WJ (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint: https://arxiv.org/abs/1510.00149.

Han S, Pool J, Tran J, Dally W (2015b). Learning both weights and connections for efficient neural network. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, 28: 1135–1143.

He P, Gao J, Chen W (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.

Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, et al. (2015). Teaching machines to read and comprehend. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems, 28: 1693–1701.

Kim S, Sun Y, Liang F (2024). Narrow and deep neural networks achieve feature learning consistency.

Kurtic E, Campos D, Nguyen T, Frantar E, Kurtz M, Fineran B, et al. (2022). The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint: https://arxiv.org/abs/2203.07259.

LeCun Y, Denker J, Solla S (1989). Optimal brain damage. In: Touretzky DS (eds) Advances in Neural Information Processing Systems, 2: 598–605.

Lee N, Ajanthan T, Torr PH (2018). Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint: https://arxiv.org/abs/1810.02340.

Levesque H, Davis E, Morgenstern L (2012). The winograd schema challenge. In: Brewka G, Eiter T, McIlraith SA (eds) Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint: https://arxiv.org/abs/1910.13461.

Li Y, Yu Y, Zhang Q, Liang C, He P, Chen W, et al. (2023). Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint: https://arxiv.org/abs/2306.11222.

Liang C, Zuo S, Chen M, Jiang H, Liu X, He P, et al. (2021). Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint: https://arxiv.org/abs/2105.12002.

Liang F, Jia B, Xue J, Li Q, Luo Y (2018a). An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond. Journal of the Royal Statistical Society, Series B, 80(5): 899–926. https://doi.org/10.1111/rssb.12279

Liang F, Li Q, Zhou L (2018b). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523): 955–972. https://doi.org/10.1080/01621459.2017.1409122

Liang S, Sun Y, Liang F (2022). Nonlinear sufficient dimension reduction with a stochastic neural network. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A (eds) Advances in Neural Information Processing Systems 35.

Lin CY (2004). Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, 74–81.

Loshchilov I, Hutter F (2019). Decoupled weight decay regularization.

Louizos C, Welling M, Kingma DP (2017). Learning sparse neural networks through $l\text{\_}0$ regularization. arXiv preprint: https://arxiv.org/abs/1712.01312.

Molchanov P, Mallya A, Tyree S, Frosio I, Kautz J (2019). Importance estimation for neural network pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11264–11272.

Narayan S, Cohen SB, Lapata M (2018). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint: https://arxiv.org/abs/1808.08745.

Portnoy S (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity. The Annals of Statistics, 16(1): 356–366. https://doi.org/10.1214/aos/1176350710

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.

Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint: https://arxiv.org/abs/1606.05250.

Sanh V, Wolf T, Rush A (2020). Movement pruning: Adaptive sparsity by fine-tuning. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 20378–20389.

Singh SP, Alistarh D (2020). Woodfisher: Efficient second-order approximation for neural network compression. In: Larochelle H, Ranzato M, Hadsell R, Balcan M-F, Lin H-T (eds) Advances in Neural Information Processing Systems 33: 18098–18109.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642.

Song Q, Liang F (2022). Nearly optimal Bayesian shrinkage for high-dimensional regression. Science China Mathematics, 66: 409–442. https://doi.org/10.1007/s11425-020-1912-6

Strubell E, Ganesh A, McCallum A (2020). Energy and policy considerations for deep learning in nlp. 2019, arXiv preprint: https://arxiv.org/abs/1906.02243.

Sun Y, Liang F (2022). A kernel-expanded stochastic neural network. Journal of the Royal Statistical Society Series B, 84(2): 547–578. https://doi.org/10.1111/rssb.12496

Sun Y, Song Q, Liang F (2022a). Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, 117: 1981–1995. https://doi.org/10.1080/01621459.2021.1895175

Sun Y, Song Q, Liang F (2022b). Learning sparse deep neural networks with a spike-and-slab prior. Statistics & Probability Letters, 180: 109246. https://doi.org/10.1016/j.spl.2021.109246

Sun Y, Xiong W, Liang F (2021). Sparse deep learning: A new framework immune to local traps and miscalibration. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems 34: 22301–22312.

Thickstun J (2020). The transformer model in equations. https://johnthickstun.com/docs/transformers.pdf.

Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint: https://arxiv.org/abs/2307.09288.

Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint: https://arxiv.org/abs/1804.07461.

Wang H, Qin C, Zhang Y, Fu Y (2020). Neural pruning via growing regularization. arXiv preprint: https://arxiv.org/abs/2012.09243.

Warstadt A, Singh A, Bowman SR (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7: 625–641. https://doi.org/10.1162/tacl_a_00290

Williams A, Nangia N, Bowman SR (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint: https://arxiv.org/abs/1704.05426.

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Transformers: State-of-the-art natural language processing. In: Liu Q, Schlangen D (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Association for Computational Linguistics, Online.

Zafrir O, Larey A, Boudoukh G, Shen H, Wasserblat M (2021). Prune once for all: Sparse pre-trained language models. arXiv preprint: https://arxiv.org/abs/2111.05754.

Zhang M, Sun Y, Liang F (2023). Sparse deep learning for time series: Theory and Applications. In: Oh A, Naumann T, Globerson A, Saenko K, Levine S (eds) Advances in Neural Information Processing Systems 35.

Zhang Q, Zuo S, Liang C, Bukharin A, He P, Chen W, et al. (2022). Kamalika Chaudhuri and Stefanie Jegelka and Le Song and Csaba Szepesvári and Gang Niu and Sivan Sabato, Platon: Pruning large transformer models with upper confidence bound of weight importance. In: Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S (eds) International Conference on Machine Learning: 26809–26823. PMLR.

Zhu M, Gupta S (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint: https://arxiv.org/abs/1710.01878.

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

consistency large language model sparsity stochastic transformer transformer

Funding

Liang’s research is support in part by the NSF grants DMS-2015498 and DMS-2210819, and the NIH grant R01-GM152717.

Metrics

since February 2021

425

Article info
views

152

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file