Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1134

10.6339/24-JDS1134

Statistical Data Science

Rethinking Attention Weights as Bidirectional Coefficients

Huang

Yuxiang

1 Yang

Hanfang

1∗ Wang

Xingrui

2 1School of Statistics, Renmin University of China, Beijing, China 2Whiting School of Engineering, Johns Hopkins University, Baltimore, USA

∗Corresponding author.hyang@ruc.edu.cn Email: hyang@ruc.edu.cn.

2024

14112024

00117

Supplementary Material

The supplementary materials include: proof of propositions, description of activation functions, detailed experiment setting and additional experiment results. Our Python code in experiment section is also available on Github at https://github.com/BruceHYX/bidirectional_attention.

41120231642024

2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2024

Open access article under the CC BY license.

Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.

Keywords attention mechanism bidirectional coefficients interpretability

This research was partially supported by the Major Project of the MOE (China) National Key Research Bases for Humanities and Social Sciences (22JJD910003).

References

Abramson

, Braverman

, Sebestyen

(1963). Pattern recognition and machine learning. IEEE Transactions on Information Theory, 9(4): 257–261. https://doi.org/10.1109/TIT.1963.1057854

Bach

, Binder

, Montavon

, Klauschen

, Müller

, Samek

(2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7): e0130140. https://doi.org/10.1371/journal.pone.0130140

Bridle

(1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing – Algorithms, Architectures and Applications, Proceedings of the NATO Advanced Research Workshop on Neurocomputing Algorithms, Architectures and Applications, Les Arcs, France, February 27–March 3, 1989 (

Fogelman-Soulié,

Hérault, eds.), volume 68 of NATO ASI Series. 227–236. Springer.

Choromanski

, Likhosherstov

, Dohan

, Song

, Gane

, Sarlós

, et al. (2021). Rethinking attention with performers. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (

Mohamed,

Hofmann,

Oh,

Murray,

Titov, eds.), OpenReview.net.

Dehghani

, Gouws

, Vinyals

, Uszkoreit

, Kaiser

(2019). Universal transformers. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019 (

Sainath,

Rush,

Levine,

Livescu,

Mohamed, eds.), OpenReview.net.

Denil

, Demiraj

, De Freitas

(2014). Extraction of salient sentences from labelled documents. arXiv preprint: https://arxiv.org/abs/1412.6815.

Devlin

, Chang

, Lee

, Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (

Burstein,

Doran,

Solorio, eds.), 4171–4186. Association for Computational Linguistics.

Dong

, Cordonnier

, Loukas

(2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, July 18–24, 2021 (

Meila,

Zhang, eds.), volume 139 of Proceedings of Machine Learning Research, 2793–2803. PMLR.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (

Mohamed,

Hofmann,

Oh,

Murray,

Titov, eds.), OpenReview.net.

Ganea

, Gelly

, Bécigneul

, Severyn

(2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, California, USA, June 9–15, 2019 (

Chaudhuri,

Salakhutdinov, eds.), volume 97 of Proceedings of Machine Learning Research. 2073–2082. PMLR.

Jain

, Wallace

(2019). Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers) (

Burstein,

Doran,

Solorio, eds.), 3543–3556. Association for Computational Linguistics.

Kanai

, Fujiwara

, Yamanaka

, Adachi

(2018). Sigsoftmax: Reanalysis of the softmax bottleneck. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, December 3–8, 2018 (

Bengio,

Wallach,

Larochelle,

Grauman,

Cesa-Bianchi,

Garnett, eds.), 284–294.

Katharopoulos

, Vyas

, Pappas

, Fleuret

(2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, July 13–18, 2020 (

Blei,

Daume,

Singh, eds.), volume 119 of Proceedings of Machine Learning Research, 5156–5165. PMLR.

Kingma

, Ba

(2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015 (

Bengio,

LeCun, eds.), Conference Track Proceedings.

Kitaev

, Kaiser

, Levskaya

(2020). Reformer: The efficient transformer. In: 8th International Conference on Learning Representations, ICLR 2020 (

Rush,

Mohamed,

Song,

Cho,

White, eds.), Addis Ababa, Ethiopia, April 26–30, 2020, OpenReview.net.

, Chen

, Hovy

, Jurafsky

(2016a). Visualizing and understanding neural models in NLP. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (

Knight,

Nenkova,

Rambow, eds.), 681–691. The Association for Computational Linguistics.

, Monroe

, Jurafsky

(2016b). Understanding neural networks through representation erasure. CoRR, abs/1612.08220.

Lin

(2021). Breaking the softmax bottleneck for sequential recommender systems with dropout and decoupling. CoRR, abs/2110.05409.

Martins

AFT

, Astudillo

(2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016 (

Balcan,

Weinberger, eds.), volume 48 of JMLR Workshop and Conference Proceedings. 1614–1623. JMLR.org.

Peng

, Pappas

, Yogatama

, Schwartz

, Smith

, Kong

(2021). Random feature attention. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (

Mohamed,

Hofmann,

Oh,

Murray,

Titov, eds.), OpenReview.net.

Ribeiro

, Singh

, Guestrin

(2016a). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016 (

Krishnapuram,

Shah,

Smola,

Aggarwal,

Shen,

Rastogi, eds.), 1135–1144. ACM.

Ribeiro

, Singh

, Guestrin

(2016b). “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016 (

Knight,

Nenkova,

Rambow, eds.), 97–101. The Association for Computational Linguistics.

Robnik-Sikonja

, Bohanec

(2018). Perturbation-based explanations of prediction models. In: Human and Machine Learning - Visible, Explainable, Trustworthy and Transparent (

Zhou,

Chen, eds.), In: Human-Computer Interaction Series, 159–175. Springer.

Samek

, Montavon

, Vedaldi

, Hansen

, Müller

(Eds.) (2019). Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer.

Selvaraju

, Cogswell

, Das

, Vedantam

, Parikh

, Batra

(2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (

Ikeuchi,

Medioni,

Pelillo, eds.), 618–626.

Serrano

, Smith

(2019). Is attention interpretable? In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers (

Korhonen,

Traum,

Màrquez, eds.), 2931–2951. Association for Computational Linguistics.

Shim

, Lee

, Choi

, Boo

, Sung

(2017). Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (

Guyon,

von Luxburg,

Bengio,

Wallach,

Fergus,

SVN

Vishwanathan,

Garnett, eds.), 5463–5473.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

, Salakhutdinov

(2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1): 1929–1958.

Sun

, Lu

(2020). Understanding attention for text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020 (

Jurafsky,

Chai,

Schluter,

Tetreault, eds.), 3418–3428. Association for Computational Linguistics.

Titsias

(2016). One-vs-each approximation to softmax for scalable estimation of probabilities. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, December 5–10, 2016 (

Lee,

Sugiyama,

von Luxburg,

Guyon,

Garnett, eds.), 4161–4169.

Touvron

, Lavril

, Izacard

, Martinet

, Lachaux

, Lacroix

, et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint: https://arxiv.org/abs/2302.13971.

Vashishth

, Upadhyay

, Tomar

, Faruqui

(2019). Attention interpretability across NLP tasks. CoRR, abs/1909.11218.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

, et al. (2017). Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, December 4–9, 2017 (

Guyon,

von Luxburg,

Bengio,

Wallach,

Fergus,

SVN

Vishwanathan,

Garnett, eds.), 5998–6008.

Wang

, Li

, Khabsa

, Fang

, Ma

(2020). Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.

Wang

, Girshick

, Gupta

, He

(2018). Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018 (

Brown,

Morse,

Peleg, eds.), 7794–7803. Computer Vision Foundation / IEEE Computer Society.

Yang

, Dai

, Salakhutdinov

, Cohen

(2018). Breaking the softmax bottleneck: A high-rank RNN language model. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018 (

Bengio,

LeCun,

Sainath, eds.), Conference Track Proceedings. OpenReview.net.

Yang

, Luong

, Salakhutdinov

, Le

(2019). Mixtape: Breaking the softmax bottleneck efficiently. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, December 8–14, 2019 (

Wallach,

Larochelle,

Beygelzimer,

d’Alché-Buc,

Fox,

Garnett, eds.), 15922–15930.

Zhen

, Sun

, Deng

, Li

, Wei

, Lv

, et al. (2022). Cosformer: Rethinking softmax in attention. In: International Conference on Learning Representations (

Hofman,

Rush,

Liu,

Finn,

Choi,

Deisenroth, eds.).