Rethinking Attention Weights as Bidirectional Coefficients

Huang, Yuxiang; Yang, Hanfang; Wang, Xingrui

doi:10.6339/24-JDS1134

Journal of Data Science

Rethinking Attention Weights as Bidirectional Coefficients

Yuxiang Huang Hanfang Yang Xingrui Wang

https://doi.org/10.6339/24-JDS1134

Pub. online: 14 November 2024 Type: Statistical Data Science

Open Access

Received
4 November 2023

Accepted
16 April 2024

Published
14 November 2024

Abstract

Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.

Supplementary material

Supplementary Material

The supplementary materials include: proof of propositions, description of activation functions, detailed experiment setting and additional experiment results. Our Python code in experiment section is also available on Github at https://github.com/BruceHYX/bidirectional_attention.

References

Abramson NM, Braverman DJ, Sebestyen GS (1963). Pattern recognition and machine learning. IEEE Transactions on Information Theory, 9(4): 257–261. https://doi.org/10.1109/TIT.1963.1057854

Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7): e0130140. https://doi.org/10.1371/journal.pone.0130140

Bridle JS (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing – Algorithms, Architectures and Applications, Proceedings of the NATO Advanced Research Workshop on Neurocomputing Algorithms, Architectures and Applications, Les Arcs, France, February 27–March 3, 1989 (F Fogelman-Soulié, J Hérault, eds.), volume 68 of NATO ASI Series. 227–236. Springer.

Choromanski KM, Likhosherstov V, Dohan D, Song X, Gane A, Sarlós T, et al. (2021). Rethinking attention with performers. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (S Mohamed, K Hofmann, A Oh, N Murray, I Titov, eds.), OpenReview.net.

Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser L (2019). Universal transformers. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019 (T Sainath, A Rush, S Levine, K Livescu, S Mohamed, eds.), OpenReview.net.

Denil M, Demiraj A, De Freitas N (2014). Extraction of salient sentences from labelled documents. arXiv preprint: https://arxiv.org/abs/1412.6815.

Devlin J, Chang M, Lee K, Toutanova K (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (J Burstein, C Doran, T Solorio, eds.), 4171–4186. Association for Computational Linguistics.

Dong Y, Cordonnier J, Loukas A (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, July 18–24, 2021 (M Meila, T Zhang, eds.), volume 139 of Proceedings of Machine Learning Research, 2793–2803. PMLR.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (S Mohamed, K Hofmann, A Oh, N Murray, I Titov, eds.), OpenReview.net.

Ganea O, Gelly S, Bécigneul G, Severyn A (2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, California, USA, June 9–15, 2019 (K Chaudhuri, R Salakhutdinov, eds.), volume 97 of Proceedings of Machine Learning Research. 2073–2082. PMLR.

Jain S, Wallace BC (2019). Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (J Burstein, C Doran, T Solorio, eds.), 3543–3556. Association for Computational Linguistics.

Kanai S, Fujiwara Y, Yamanaka Y, Adachi S (2018). Sigsoftmax: Reanalysis of the softmax bottleneck. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, December 3–8, 2018 (S Bengio, HM Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett, eds.), 284–294.

Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, July 13–18, 2020 (D Blei, H Daume, A Singh, eds.), volume 119 of Proceedings of Machine Learning Research, 5156–5165. PMLR.

Kingma DP, Ba J (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015 (Y Bengio, Y LeCun, eds.), Conference Track Proceedings.

Kitaev N, Kaiser L, Levskaya A (2020). Reformer: The efficient transformer. In: 8th International Conference on Learning Representations, ICLR 2020 (A Rush, S Mohamed, D Song, K Cho, M White, eds.), Addis Ababa, Ethiopia, April 26–30, 2020, OpenReview.net.

Li J, Chen X, Hovy EH, Jurafsky D (2016a). Visualizing and understanding neural models in NLP. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (K Knight, A Nenkova, O Rambow, eds.), 681–691. The Association for Computational Linguistics.

Li J, Monroe W, Jurafsky D (2016b). Understanding neural networks through representation erasure. CoRR, abs/1612.08220.

Lin Y (2021). Breaking the softmax bottleneck for sequential recommender systems with dropout and decoupling. CoRR, abs/2110.05409.

Martins AFT, Astudillo RF (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016 (M Balcan, KQ Weinberger, eds.), volume 48 of JMLR Workshop and Conference Proceedings. 1614–1623. JMLR.org.

Peng H, Pappas N, Yogatama D, Schwartz R, Smith NA, Kong L (2021). Random feature attention. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (S Mohamed, K Hofmann, A Oh, N Murray, I Titov, eds.), OpenReview.net.

Ribeiro MT, Singh S, Guestrin C (2016a). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016 (B Krishnapuram, M Shah, AJ Smola, CC Aggarwal, D Shen, R Rastogi, eds.), 1135–1144. ACM.

Ribeiro MT, Singh S, Guestrin C (2016b). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (K Knight, A Nenkova, O Rambow, eds.), 97–101. The Association for Computational Linguistics.

Robnik-Sikonja M, Bohanec M (2018). Perturbation-based explanations of prediction models. In: Human and Machine Learning - Visible, Explainable, Trustworthy and Transparent (J Zhou, F Chen, eds.), In: Human-Computer Interaction Series, 159–175. Springer.

Samek W, Montavon G, Vedaldi A, Hansen LK, Müller K (Eds.) (2019). Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer.

Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (K Ikeuchi, G Medioni, M Pelillo, eds.), 618–626.

Serrano S, Smith NA (2019). Is attention interpretable? In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers (A Korhonen, DR Traum, L Màrquez, eds.), 2931–2951. Association for Computational Linguistics.

Shim K, Lee M, Choi I, Boo Y, Sung W (2017). Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (I Guyon, U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, R Garnett, eds.), 5463–5473.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1): 1929–1958.

Sun X, Lu W (2020). Understanding attention for text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020 (D Jurafsky, J Chai, N Schluter, JR Tetreault, eds.), 3418–3428. Association for Computational Linguistics.

Titsias MK (2016). One-vs-each approximation to softmax for scalable estimation of probabilities. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, December 5–10, 2016 (DD Lee, M Sugiyama, U von Luxburg, I Guyon, R Garnett, eds.), 4161–4169.

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint: https://arxiv.org/abs/2302.13971.

Vashishth S, Upadhyay S, Tomar GS, Faruqui M (2019). Attention interpretability across NLP tasks. CoRR, abs/1909.11218.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (I Guyon, U von Luxburg, S Bengio, HM Wallach, R Fergus, SVN Vishwanathan, R Garnett, eds.), 5998–6008.

Wang S, Li BZ, Khabsa M, Fang H, Ma H (2020). Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.

Wang X, Girshick RB, Gupta A, He K (2018). Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018 (MS Brown, B Morse, S Peleg, eds.), 7794–7803. Computer Vision Foundation / IEEE Computer Society.

Yang Z, Dai Z, Salakhutdinov R, Cohen WW (2018). Breaking the softmax bottleneck: A high-rank RNN language model. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018 (Y Bengio, Y LeCun, T Sainath, eds.), Conference Track Proceedings. OpenReview.net.

Yang Z, Luong T, Salakhutdinov R, Le QV (2019). Mixtape: Breaking the softmax bottleneck efficiently. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, December 8–14, 2019 (HM Wallach, H Larochelle, A Beygelzimer, F d’Alché-Buc, EB Fox, R Garnett, eds.), 15922–15930.

Zhen Q, Sun W, Deng H, Li D, Wei Y, Lv B, et al. (2022). Cosformer: Rethinking softmax in attention. In: International Conference on Learning Representations (K Hofman, A Rush, Y Liu, C Finn, Y Choi, M Deisenroth, eds.).

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

attention mechanism bidirectional coefficients interpretability

Funding

This research was partially supported by the Major Project of the MOE (China) National Key Research Bases for Humanities and Social Sciences (22JJD910003).

Metrics

since February 2021

156

Article info
views

169

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file