Rethinking Attention Weights as Bidirectional Coefficients
Pub. online: 14 November 2024
Type: Statistical Data Science
Open Access
Received
4 November 2023
4 November 2023
Accepted
16 April 2024
16 April 2024
Published
14 November 2024
14 November 2024
Abstract
Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.
Supplementary material
Supplementary MaterialThe supplementary materials include: proof of propositions, description of activation functions, detailed experiment setting and additional experiment results. Our Python code in experiment section is also available on Github at https://github.com/BruceHYX/bidirectional_attention.
References
Abramson NM, Braverman DJ, Sebestyen GS (1963). Pattern recognition and machine learning. IEEE Transactions on Information Theory, 9(4): 257–261. https://doi.org/10.1109/TIT.1963.1057854
Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7): e0130140. https://doi.org/10.1371/journal.pone.0130140
Bridle JS (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing – Algorithms, Architectures and Applications, Proceedings of the NATO Advanced Research Workshop on Neurocomputing Algorithms, Architectures and Applications, Les Arcs, France, February 27–March 3, 1989 (F Fogelman-Soulié, J Hérault, eds.), volume 68 of NATO ASI Series. 227–236. Springer.
Choromanski KM, Likhosherstov V, Dohan D, Song X, Gane A, Sarlós T, et al. (2021). Rethinking attention with performers. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (S Mohamed, K Hofmann, A Oh, N Murray, I Titov, eds.), OpenReview.net.
Denil M, Demiraj A, De Freitas N (2014). Extraction of salient sentences from labelled documents. arXiv preprint: https://arxiv.org/abs/1412.6815.
Devlin J, Chang M, Lee K, Toutanova K (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (J Burstein, C Doran, T Solorio, eds.), 4171–4186. Association for Computational Linguistics.
Dong Y, Cordonnier J, Loukas A (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, July 18–24, 2021 (M Meila, T Zhang, eds.), volume 139 of Proceedings of Machine Learning Research, 2793–2803. PMLR.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (S Mohamed, K Hofmann, A Oh, N Murray, I Titov, eds.), OpenReview.net.
Ganea O, Gelly S, Bécigneul G, Severyn A (2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, California, USA, June 9–15, 2019 (K Chaudhuri, R Salakhutdinov, eds.), volume 97 of Proceedings of Machine Learning Research. 2073–2082. PMLR.
Jain S, Wallace BC (2019). Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (J Burstein, C Doran, T Solorio, eds.), 3543–3556. Association for Computational Linguistics.
Kanai S, Fujiwara Y, Yamanaka Y, Adachi S (2018). Sigsoftmax: Reanalysis of the softmax bottleneck. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, December 3–8, 2018 (S Bengio, HM Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.), 284–294.
Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, July 13–18, 2020 (D Blei, H Daume, A Singh, eds.), volume 119 of Proceedings of Machine Learning Research, 5156–5165. PMLR.
Li J, Chen X, Hovy EH, Jurafsky D (2016a). Visualizing and understanding neural models in NLP. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (K Knight, A Nenkova, O Rambow, eds.), 681–691. The Association for Computational Linguistics.
Martins AFT, Astudillo RF (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016 (M Balcan, KQ Weinberger, eds.), volume 48 of JMLR Workshop and Conference Proceedings. 1614–1623. JMLR.org.
Ribeiro MT, Singh S, Guestrin C (2016a). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016 (B Krishnapuram, M Shah, AJ Smola, CC Aggarwal, D Shen, R Rastogi, eds.), 1135–1144. ACM.
Ribeiro MT, Singh S, Guestrin C (2016b). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (K Knight, A Nenkova, O Rambow, eds.), 97–101. The Association for Computational Linguistics.
Serrano S, Smith NA (2019). Is attention interpretable? In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers (A Korhonen, DR Traum, L Màrquez, eds.), 2931–2951. Association for Computational Linguistics.
Shim K, Lee M, Choi I, Boo Y, Sung W (2017). Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (I Guyon, U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, R Garnett, eds.), 5463–5473.
Sun X, Lu W (2020). Understanding attention for text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020 (D Jurafsky, J Chai, N Schluter, JR Tetreault, eds.), 3418–3428. Association for Computational Linguistics.
Titsias MK (2016). One-vs-each approximation to softmax for scalable estimation of probabilities. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, December 5–10, 2016 (DD Lee, M Sugiyama, U von Luxburg, I Guyon, R Garnett, eds.), 4161–4169.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint: https://arxiv.org/abs/2302.13971.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (I Guyon, U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, R Garnett, eds.), 5998–6008.
Yang Z, Dai Z, Salakhutdinov R, Cohen WW (2018). Breaking the softmax bottleneck: A high-rank RNN language model. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018 (Y Bengio, Y LeCun, T Sainath, eds.), Conference Track Proceedings. OpenReview.net.
Yang Z, Luong T, Salakhutdinov R, Le QV (2019). Mixtape: Breaking the softmax bottleneck efficiently. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, December 8–14, 2019 (HM Wallach, H Larochelle, A Beygelzimer, F d’Alché-Buc, EB Fox, R Garnett, eds.), 15922–15930.