Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior
Pub. online: 26 November 2024
Type: Computing In Data Science
Open Access
Received
11 July 2024
11 July 2024
Accepted
6 October 2024
6 October 2024
Published
26 November 2024
26 November 2024
Abstract
Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model’s expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.
Supplementary material
Supplementary MaterialThe supplementary material includes (i) a brief description for the prior annealing algorithm, (ii) detailed experimental settings, and (iii) a folder (code) which contains all the code for the proposed algorithm MGPP as well as the code to reproduce the experiments.
References
Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint: https://arxiv.org/abs/1708.00055.
Dagan I, Glickman O, Magnini B (2006). The Pascal recognising textual entailment challenge. In: Quiñonero-Candela J, Dagan I, Magnini B, d’Alché-Buc F (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, 177–190. Springer.
Frankle J, Carbin M (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint: https://arxiv.org/abs/1803.03635.
Han S, Mao H, Dally WJ (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint: https://arxiv.org/abs/1510.00149.
He P, Gao J, Chen W (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.
Kurtic E, Campos D, Nguyen T, Frantar E, Kurtz M, Fineran B, et al. (2022). The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint: https://arxiv.org/abs/2203.07259.
Lee N, Ajanthan T, Torr PH (2018). Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint: https://arxiv.org/abs/1810.02340.
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint: https://arxiv.org/abs/1910.13461.
Li Y, Yu Y, Zhang Q, Liang C, He P, Chen W, et al. (2023). Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint: https://arxiv.org/abs/2306.11222.
Liang C, Zuo S, Chen M, Jiang H, Liu X, He P, et al. (2021). Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint: https://arxiv.org/abs/2105.12002.
Liang F, Jia B, Xue J, Li Q, Luo Y (2018a). An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond. Journal of the Royal Statistical Society, Series B, 80(5): 899–926. https://doi.org/10.1111/rssb.12279
Liang F, Li Q, Zhou L (2018b). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523): 955–972. https://doi.org/10.1080/01621459.2017.1409122
Louizos C, Welling M, Kingma DP (2017). Learning sparse neural networks through $l\text{\_}0$ regularization. arXiv preprint: https://arxiv.org/abs/1712.01312.
Narayan S, Cohen SB, Lapata M (2018). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint: https://arxiv.org/abs/1808.08745.
Portnoy S (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tend to infinity. The Annals of Statistics, 16(1): 356–366. https://doi.org/10.1214/aos/1176350710
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint: https://arxiv.org/abs/1606.05250.
Song Q, Liang F (2022). Nearly optimal Bayesian shrinkage for high-dimensional regression. Science China Mathematics, 66: 409–442. https://doi.org/10.1007/s11425-020-1912-6
Strubell E, Ganesh A, McCallum A (2020). Energy and policy considerations for deep learning in nlp. 2019, arXiv preprint: https://arxiv.org/abs/1906.02243.
Sun Y, Liang F (2022). A kernel-expanded stochastic neural network. Journal of the Royal Statistical Society Series B, 84(2): 547–578. https://doi.org/10.1111/rssb.12496
Sun Y, Song Q, Liang F (2022a). Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, 117: 1981–1995. https://doi.org/10.1080/01621459.2021.1895175
Sun Y, Song Q, Liang F (2022b). Learning sparse deep neural networks with a spike-and-slab prior. Statistics & Probability Letters, 180: 109246. https://doi.org/10.1016/j.spl.2021.109246
Thickstun J (2020). The transformer model in equations. https://johnthickstun.com/docs/transformers.pdf.
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint: https://arxiv.org/abs/2307.09288.
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint: https://arxiv.org/abs/1804.07461.
Wang H, Qin C, Zhang Y, Fu Y (2020). Neural pruning via growing regularization. arXiv preprint: https://arxiv.org/abs/2012.09243.
Warstadt A, Singh A, Bowman SR (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7: 625–641. https://doi.org/10.1162/tacl_a_00290
Williams A, Nangia N, Bowman SR (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint: https://arxiv.org/abs/1704.05426.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Transformers: State-of-the-art natural language processing. In: Liu Q, Schlangen D (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Association for Computational Linguistics, Online.
Zafrir O, Larey A, Boudoukh G, Shen H, Wasserblat M (2021). Prune once for all: Sparse pre-trained language models. arXiv preprint: https://arxiv.org/abs/2111.05754.
Zhang Q, Zuo S, Liang C, Bukharin A, He P, Chen W, et al. (2022). Kamalika Chaudhuri and Stefanie Jegelka and Le Song and Csaba Szepesvári and Gang Niu and Sivan Sabato, Platon: Pruning large transformer models with upper confidence bound of weight importance. In: Chaudhuri K, Jegelka S, Song L, Szepesvári C, Niu G, Sabato S (eds) International Conference on Machine Learning: 26809–26823. PMLR.
Zhu M, Gupta S (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint: https://arxiv.org/abs/1710.01878.