<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1134</article-id>
<article-id pub-id-type="doi">10.6339/24-JDS1134</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Rethinking Attention Weights as Bidirectional Coefficients</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Huang</surname><given-names>Yuxiang</given-names></name><xref ref-type="aff" rid="j_jds1134_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yang</surname><given-names>Hanfang</given-names></name><xref ref-type="aff" rid="j_jds1134_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>Xingrui</given-names></name><xref ref-type="aff" rid="j_jds1134_aff_002">2</xref>
</contrib>
<aff id="j_jds1134_aff_001"><label>1</label>School of Statistics, <institution>Renmin University of China</institution>, Beijing, <country>China</country></aff>
<aff id="j_jds1134_aff_002"><label>2</label>Whiting School of Engineering, <institution>Johns Hopkins University</institution>, Baltimore, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.<email xlink:href="mailto:hyang@ruc.edu.cn">hyang@ruc.edu.cn</email> Email: <ext-link ext-link-type="uri" xlink:href="mailto:hyang@ruc.edu.cn">hyang@ruc.edu.cn</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>14</day><month>11</month><year>2024</year></pub-date><volume content-type="ahead-of-print">0</volume><issue>0</issue><fpage>1</fpage><lpage>17</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1134_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The supplementary materials include: proof of propositions, description of activation functions, detailed experiment setting and additional experiment results. Our Python code in experiment section is also available on Github at <ext-link ext-link-type="uri" xlink:href="https://github.com/BruceHYX/bidirectional_attention">https://github.com/BruceHYX/bidirectional_attention</ext-link>.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>4</day><month>11</month><year>2023</year></date><date date-type="accepted"><day>16</day><month>4</month><year>2024</year></date></history>
<permissions><copyright-statement>2024 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>attention mechanism</kwd>
<kwd>bidirectional coefficients</kwd>
<kwd>interpretability</kwd>
</kwd-group>
<funding-group><funding-statement>This research was partially supported by the Major Project of the MOE (China) National Key Research Bases for Humanities and Social Sciences (22JJD910003).</funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1134_reflist_001">
<title>References</title>
<ref id="j_jds1134_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Abramson</surname> <given-names>NM</given-names></string-name>, <string-name><surname>Braverman</surname> <given-names>DJ</given-names></string-name>, <string-name><surname>Sebestyen</surname> <given-names>GS</given-names></string-name> (<year>1963</year>). <article-title>Pattern recognition and machine learning</article-title>. <source><italic>IEEE Transactions on Information Theory</italic></source>, <volume>9</volume>(<issue>4</issue>): <fpage>257</fpage>–<lpage>261</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TIT.1963.1057854" xlink:type="simple">https://doi.org/10.1109/TIT.1963.1057854</ext-link></mixed-citation>
</ref>
<ref id="j_jds1134_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Bach</surname> <given-names>S</given-names></string-name>, <string-name><surname>Binder</surname> <given-names>A</given-names></string-name>, <string-name><surname>Montavon</surname> <given-names>G</given-names></string-name>, <string-name><surname>Klauschen</surname> <given-names>F</given-names></string-name>, <string-name><surname>Müller</surname> <given-names>KR</given-names></string-name>, <string-name><surname>Samek</surname> <given-names>W</given-names></string-name> (<year>2015</year>). <article-title>On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation</article-title>. <source><italic>PLoS ONE</italic></source>, <volume>10</volume>(<issue>7</issue>): <fpage>e0130140</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0130140" xlink:type="simple">https://doi.org/10.1371/journal.pone.0130140</ext-link></mixed-citation>
</ref>
<ref id="j_jds1134_ref_003">
<mixed-citation publication-type="chapter"> <string-name><surname>Bridle</surname> <given-names>JS</given-names></string-name> (<year>1989</year>). <chapter-title>Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition</chapter-title>. In: <source><italic>Neurocomputing – Algorithms, Architectures and Applications, Proceedings of the NATO Advanced Research Workshop on Neurocomputing Algorithms, Architectures and Applications</italic></source>, <conf-loc>Les Arcs, France</conf-loc>, <conf-date>February 27–March 3, 1989</conf-date> (<string-name><given-names>F</given-names> <surname>Fogelman-Soulié</surname></string-name>, <string-name><given-names>J</given-names> <surname>Hérault</surname></string-name>, eds.), volume <volume>68</volume> of <series><italic>NATO ASI Series</italic></series>. <fpage>227</fpage>–<lpage>236</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_004">
<mixed-citation publication-type="chapter"> <string-name><surname>Choromanski</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Likhosherstov</surname> <given-names>V</given-names></string-name>, <string-name><surname>Dohan</surname> <given-names>D</given-names></string-name>, <string-name><surname>Song</surname> <given-names>X</given-names></string-name>, <string-name><surname>Gane</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sarlós</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> (<year>2021</year>). <chapter-title>Rethinking attention with performers</chapter-title>. In: <source><italic>9th International Conference on Learning Representations, ICLR 2021</italic></source>, <comment>Virtual Event</comment>, <conf-loc>Austria</conf-loc>, <conf-date>May 3–7, 2021</conf-date> (<string-name><given-names>S</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>K</given-names> <surname>Hofmann</surname></string-name>, <string-name><given-names>A</given-names> <surname>Oh</surname></string-name>, <string-name><given-names>N</given-names> <surname>Murray</surname></string-name>, <string-name><given-names>I</given-names> <surname>Titov</surname></string-name>, eds.), <comment>OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_005">
<mixed-citation publication-type="chapter"> <string-name><surname>Dehghani</surname> <given-names>M</given-names></string-name>, <string-name><surname>Gouws</surname> <given-names>S</given-names></string-name>, <string-name><surname>Vinyals</surname> <given-names>O</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kaiser</surname> <given-names>L</given-names></string-name> (<year>2019</year>). <chapter-title>Universal transformers</chapter-title>. In: <source><italic>7th International Conference on Learning Representations, ICLR 2019</italic></source>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>May 6–9, 2019</conf-date> (<string-name><given-names>T</given-names> <surname>Sainath</surname></string-name>, <string-name><given-names>A</given-names> <surname>Rush</surname></string-name>, <string-name><given-names>S</given-names> <surname>Levine</surname></string-name>, <string-name><given-names>K</given-names> <surname>Livescu</surname></string-name>, <string-name><given-names>S</given-names> <surname>Mohamed</surname></string-name>, eds.), <comment>OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_006">
<mixed-citation publication-type="other"> <string-name><surname>Denil</surname> <given-names>M</given-names></string-name>, <string-name><surname>Demiraj</surname> <given-names>A</given-names></string-name>, <string-name><surname>De Freitas</surname> <given-names>N</given-names></string-name> (<year>2014</year>). Extraction of salient sentences from labelled documents. <italic>arXiv preprint:</italic> <uri>https://arxiv.org/abs/1412.6815</uri>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_007">
<mixed-citation publication-type="chapter"> <string-name><surname>Devlin</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Toutanova</surname> <given-names>K</given-names></string-name> (<year>2019</year>). <chapter-title>BERT: Pre-training of deep bidirectional transformers for language understanding</chapter-title>. In: <source><italic>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019</italic></source>, <conf-loc>Minneapolis, MN, USA</conf-loc>, <conf-date>June 2–7, 2019</conf-date>, <comment>Volume 1 (Long and Short Papers)</comment> (<string-name><given-names>J</given-names> <surname>Burstein</surname></string-name>, <string-name><given-names>C</given-names> <surname>Doran</surname></string-name>, <string-name><given-names>T</given-names> <surname>Solorio</surname></string-name>, eds.), <fpage>4171</fpage>–<lpage>4186</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_008">
<mixed-citation publication-type="chapter"> <string-name><surname>Dong</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Cordonnier</surname> <given-names>J</given-names></string-name>, <string-name><surname>Loukas</surname> <given-names>A</given-names></string-name> (<year>2021</year>). <chapter-title>Attention is not all you need: Pure attention loses rank doubly exponentially with depth</chapter-title>. In: <source><italic>Proceedings of the 38th International Conference on Machine Learning, ICML 2021</italic></source>, <comment>Virtual Event</comment>, <conf-date>July 18–24, 2021</conf-date> (<string-name><given-names>M</given-names> <surname>Meila</surname></string-name>, <string-name><given-names>T</given-names> <surname>Zhang</surname></string-name>, eds.), volume <volume>139</volume> of <series><italic>Proceedings of Machine Learning Research</italic></series>, <fpage>2793</fpage>–<lpage>2803</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_009">
<mixed-citation publication-type="chapter"> <string-name><surname>Dosovitskiy</surname> <given-names>A</given-names></string-name>, <string-name><surname>Beyer</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kolesnikov</surname> <given-names>A</given-names></string-name>, <string-name><surname>Weissenborn</surname> <given-names>D</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Unterthiner</surname> <given-names>T</given-names></string-name>, <etal>et al.</etal> (<year>2021</year>). <chapter-title>An image is worth 16x16 words: Transformers for image recognition at scale</chapter-title>. In: <source><italic>9th International Conference on Learning Representations, ICLR 2021</italic></source>, <comment>Virtual Event</comment>, <conf-loc>Austria</conf-loc>, <conf-date>May 3–7, 2021</conf-date> (<string-name><given-names>S</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>K</given-names> <surname>Hofmann</surname></string-name>, <string-name><given-names>A</given-names> <surname>Oh</surname></string-name>, <string-name><given-names>N</given-names> <surname>Murray</surname></string-name>, <string-name><given-names>I</given-names> <surname>Titov</surname></string-name>, eds.), <comment>OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_010">
<mixed-citation publication-type="chapter"> <string-name><surname>Ganea</surname> <given-names>O</given-names></string-name>, <string-name><surname>Gelly</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bécigneul</surname> <given-names>G</given-names></string-name>, <string-name><surname>Severyn</surname> <given-names>A</given-names></string-name> (<year>2019</year>). <chapter-title>Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities</chapter-title>. In: <source><italic>Proceedings of the 36th International Conference on Machine Learning, ICML 2019</italic></source>, <conf-loc>Long Beach, California, USA</conf-loc>, <conf-date>June 9–15, 2019</conf-date> (<string-name><given-names>K</given-names> <surname>Chaudhuri</surname></string-name>, <string-name><given-names>R</given-names> <surname>Salakhutdinov</surname></string-name>, eds.), volume <volume>97</volume> of <series><italic>Proceedings of Machine Learning Research</italic></series>. <fpage>2073</fpage>–<lpage>2082</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_011">
<mixed-citation publication-type="chapter"> <string-name><surname>Jain</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wallace</surname> <given-names>BC</given-names></string-name> (<year>2019</year>). <chapter-title>Attention is not explanation</chapter-title>. In: <source><italic>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019</italic></source>, <conf-loc>Minneapolis, MN, USA</conf-loc>, <conf-date>June 2–7, 2019</conf-date>, <comment>Volume 1 (Long and Short Papers)</comment> (<string-name><given-names>J</given-names> <surname>Burstein</surname></string-name>, <string-name><given-names>C</given-names> <surname>Doran</surname></string-name>, <string-name><given-names>T</given-names> <surname>Solorio</surname></string-name>, eds.), <fpage>3543</fpage>–<lpage>3556</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_012">
<mixed-citation publication-type="chapter"> <string-name><surname>Kanai</surname> <given-names>S</given-names></string-name>, <string-name><surname>Fujiwara</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yamanaka</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Adachi</surname> <given-names>S</given-names></string-name> (<year>2018</year>). <chapter-title>Sigsoftmax: Reanalysis of the softmax bottleneck</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018</italic></source>, <conf-loc>Montréal, Canada</conf-loc>, <conf-date>December 3–8, 2018</conf-date> (<string-name><given-names>S</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>HM</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>H</given-names> <surname>Larochelle</surname></string-name>, <string-name><given-names>K</given-names> <surname>Grauman</surname></string-name>, <string-name><given-names>N</given-names> <surname>Cesa-Bianchi</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <fpage>284</fpage>–<lpage>294</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_013">
<mixed-citation publication-type="chapter"> <string-name><surname>Katharopoulos</surname> <given-names>A</given-names></string-name>, <string-name><surname>Vyas</surname> <given-names>A</given-names></string-name>, <string-name><surname>Pappas</surname> <given-names>N</given-names></string-name>, <string-name><surname>Fleuret</surname> <given-names>F</given-names></string-name> (<year>2020</year>). <chapter-title>Transformers are rnns: Fast autoregressive transformers with linear attention</chapter-title>. In: <source><italic>Proceedings of the 37th International Conference on Machine Learning, ICML 2020</italic></source>, <comment>Virtual Event</comment>, <conf-date>July 13–18, 2020</conf-date> (<string-name><given-names>D</given-names> <surname>Blei</surname></string-name>, <string-name><given-names>H</given-names> <surname>Daume</surname></string-name>, <string-name><given-names>A</given-names> <surname>Singh</surname></string-name>, eds.), volume <volume>119</volume> of <series><italic>Proceedings of Machine Learning Research</italic></series>, <fpage>5156</fpage>–<lpage>5165</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_014">
<mixed-citation publication-type="chapter"> <string-name><surname>Kingma</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Ba</surname> <given-names>J</given-names></string-name> (<year>2015</year>). <chapter-title>Adam: A method for stochastic optimization</chapter-title>. In: <source><italic>3rd International Conference on Learning Representations, ICLR 2015</italic></source>, <conf-loc>San Diego, CA, USA</conf-loc>, <conf-date>May 7–9, 2015</conf-date> (<string-name><given-names>Y</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>Y</given-names> <surname>LeCun</surname></string-name>, eds.), <comment><italic>Conference Track Proceedings</italic></comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_015">
<mixed-citation publication-type="chapter"> <string-name><surname>Kitaev</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kaiser</surname> <given-names>L</given-names></string-name>, <string-name><surname>Levskaya</surname> <given-names>A</given-names></string-name> (<year>2020</year>). <chapter-title>Reformer: The efficient transformer</chapter-title>. In: <source><italic>8th International Conference on Learning Representations, ICLR 2020</italic></source> (<string-name><given-names>A</given-names> <surname>Rush</surname></string-name>, <string-name><given-names>S</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>D</given-names> <surname>Song</surname></string-name>, <string-name><given-names>K</given-names> <surname>Cho</surname></string-name>, <string-name><given-names>M</given-names> <surname>White</surname></string-name>, eds.), <conf-loc>Addis Ababa, Ethiopia</conf-loc>, <conf-date>April 26–30, 2020</conf-date>, <comment>OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_016">
<mixed-citation publication-type="chapter"> <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>X</given-names></string-name>, <string-name><surname>Hovy</surname> <given-names>EH</given-names></string-name>, <string-name><surname>Jurafsky</surname> <given-names>D</given-names></string-name> (<year>2016</year>a). <chapter-title>Visualizing and understanding neural models in NLP</chapter-title>. In: <source><italic>NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</italic></source>, <conf-loc>San Diego California, USA</conf-loc>, <conf-date>June 12–17, 2016</conf-date> (<string-name><given-names>K</given-names> <surname>Knight</surname></string-name>, <string-name><given-names>A</given-names> <surname>Nenkova</surname></string-name>, <string-name><given-names>O</given-names> <surname>Rambow</surname></string-name>, eds.), <fpage>681</fpage>–<lpage>691</lpage>. <publisher-name>The Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_017">
<mixed-citation publication-type="other"> <string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Monroe</surname> <given-names>W</given-names></string-name>, <string-name><surname>Jurafsky</surname> <given-names>D</given-names></string-name> (<year>2016</year>b). Understanding neural networks through representation erasure. <italic>CoRR</italic>, abs/1612.08220.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_018">
<mixed-citation publication-type="other"> <string-name><surname>Lin</surname> <given-names>Y</given-names></string-name> (<year>2021</year>). Breaking the softmax bottleneck for sequential recommender systems with dropout and decoupling. <italic>CoRR</italic>, abs/2110.05409.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_019">
<mixed-citation publication-type="chapter"> <string-name><surname>Martins</surname> <given-names>AFT</given-names></string-name>, <string-name><surname>Astudillo</surname> <given-names>RF</given-names></string-name> (<year>2016</year>). <chapter-title>From softmax to sparsemax: A sparse model of attention and multi-label classification</chapter-title>. In: <source><italic>Proceedings of the 33nd International Conference on Machine Learning, ICML 2016</italic></source>, <conf-loc>New York City, NY, USA</conf-loc>, <conf-date>June 19–24, 2016</conf-date> (<string-name><given-names>M</given-names> <surname>Balcan</surname></string-name>, <string-name><given-names>KQ</given-names> <surname>Weinberger</surname></string-name>, eds.), volume <volume>48</volume> of <series><italic>JMLR Workshop and Conference Proceedings</italic></series>. <fpage>1614</fpage>–<lpage>1623</lpage>. <comment>JMLR.org</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_020">
<mixed-citation publication-type="chapter"> <string-name><surname>Peng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Pappas</surname> <given-names>N</given-names></string-name>, <string-name><surname>Yogatama</surname> <given-names>D</given-names></string-name>, <string-name><surname>Schwartz</surname> <given-names>R</given-names></string-name>, <string-name><surname>Smith</surname> <given-names>NA</given-names></string-name>, <string-name><surname>Kong</surname> <given-names>L</given-names></string-name> (<year>2021</year>). <chapter-title>Random feature attention</chapter-title>. In: <source><italic>9th International Conference on Learning Representations, ICLR 2021</italic></source>, <comment>Virtual Event</comment>, <conf-loc>Austria</conf-loc>, <conf-date>May 3–7, 2021</conf-date> (<string-name><given-names>S</given-names> <surname>Mohamed</surname></string-name>, <string-name><given-names>K</given-names> <surname>Hofmann</surname></string-name>, <string-name><given-names>A</given-names> <surname>Oh</surname></string-name>, <string-name><given-names>N</given-names> <surname>Murray</surname></string-name>, <string-name><given-names>I</given-names> <surname>Titov</surname></string-name>, eds.), <comment>OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_021">
<mixed-citation publication-type="chapter"> <string-name><surname>Ribeiro</surname> <given-names>MT</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Guestrin</surname> <given-names>C</given-names></string-name> (<year>2016</year>a). <chapter-title>“why should I trust you?”: Explaining the predictions of any classifier</chapter-title>. In: <source><italic>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</italic></source>, <conf-loc>San Francisco, CA, USA</conf-loc>, <conf-date>August 13–17, 2016</conf-date> (<string-name><given-names>B</given-names> <surname>Krishnapuram</surname></string-name>, <string-name><given-names>M</given-names> <surname>Shah</surname></string-name>, <string-name><given-names>AJ</given-names> <surname>Smola</surname></string-name>, <string-name><given-names>CC</given-names> <surname>Aggarwal</surname></string-name>, <string-name><given-names>D</given-names> <surname>Shen</surname></string-name>, <string-name><given-names>R</given-names> <surname>Rastogi</surname></string-name>, eds.), <fpage>1135</fpage>–<lpage>1144</lpage>. <publisher-name>ACM</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_022">
<mixed-citation publication-type="chapter"> <string-name><surname>Ribeiro</surname> <given-names>MT</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>S</given-names></string-name>, <string-name><surname>Guestrin</surname> <given-names>C</given-names></string-name> (<year>2016</year>b). <chapter-title>“why should I trust you?”: Explaining the predictions of any classifier</chapter-title>. In: <source><italic>Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</italic></source>, <conf-loc>San Diego California, USA</conf-loc>, <conf-date>June 12–17, 2016</conf-date> (<string-name><given-names>K</given-names> <surname>Knight</surname></string-name>, <string-name><given-names>A</given-names> <surname>Nenkova</surname></string-name>, <string-name><given-names>O</given-names> <surname>Rambow</surname></string-name>, eds.), <fpage>97</fpage>–<lpage>101</lpage>. <publisher-name>The Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_023">
<mixed-citation publication-type="chapter"> <string-name><surname>Robnik-Sikonja</surname> <given-names>M</given-names></string-name>, <string-name><surname>Bohanec</surname> <given-names>M</given-names></string-name> (<year>2018</year>). <chapter-title>Perturbation-based explanations of prediction models</chapter-title>. In: <source><italic>Human and Machine Learning - Visible, Explainable, Trustworthy and Transparent</italic></source> (<string-name><given-names>J</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>F</given-names> <surname>Chen</surname></string-name>, eds.), In: <series><italic>Human-Computer Interaction Series</italic></series>, <fpage>159</fpage>–<lpage>175</lpage>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_024">
<mixed-citation publication-type="book"> <string-name><surname>Samek</surname> <given-names>W</given-names></string-name>, <string-name><surname>Montavon</surname> <given-names>G</given-names></string-name>, <string-name><surname>Vedaldi</surname> <given-names>A</given-names></string-name>, <string-name><surname>Hansen</surname> <given-names>LK</given-names></string-name>, <string-name><surname>Müller</surname> <given-names>K</given-names></string-name> (Eds.) (<year>2019</year>). <source><italic>Explainable AI: Interpreting, Explaining and Visualizing Deep Learning</italic></source>, volume <volume>11700</volume> of <series><italic>Lecture Notes in Computer Science</italic></series>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_025">
<mixed-citation publication-type="chapter"> <string-name><surname>Selvaraju</surname> <given-names>RR</given-names></string-name>, <string-name><surname>Cogswell</surname> <given-names>M</given-names></string-name>, <string-name><surname>Das</surname> <given-names>A</given-names></string-name>, <string-name><surname>Vedantam</surname> <given-names>R</given-names></string-name>, <string-name><surname>Parikh</surname> <given-names>D</given-names></string-name>, <string-name><surname>Batra</surname> <given-names>D</given-names></string-name> (<year>2017</year>). <chapter-title>Grad-cam: Visual explanations from deep networks via gradient-based localization</chapter-title>. In: <source><italic>Proceedings of the IEEE International Conference on Computer Vision</italic></source> (<string-name><given-names>K</given-names> <surname>Ikeuchi</surname></string-name>, <string-name><given-names>G</given-names> <surname>Medioni</surname></string-name>, <string-name><given-names>M</given-names> <surname>Pelillo</surname></string-name>, eds.), <fpage>618</fpage>–<lpage>626</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_026">
<mixed-citation publication-type="chapter"> <string-name><surname>Serrano</surname> <given-names>S</given-names></string-name>, <string-name><surname>Smith</surname> <given-names>NA</given-names></string-name> (<year>2019</year>). <chapter-title>Is attention interpretable?</chapter-title> In: <source><italic>Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019</italic></source>, <conf-loc>Florence, Italy</conf-loc>, <conf-date>July 28–August 2, 2019</conf-date>, <comment>Volume 1: Long Papers</comment> (<string-name><given-names>A</given-names> <surname>Korhonen</surname></string-name>, <string-name><given-names>DR</given-names> <surname>Traum</surname></string-name>, <string-name><given-names>L</given-names> <surname>Màrquez</surname></string-name>, eds.), <fpage>2931</fpage>–<lpage>2951</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_027">
<mixed-citation publication-type="chapter"> <string-name><surname>Shim</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>M</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>I</given-names></string-name>, <string-name><surname>Boo</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sung</surname> <given-names>W</given-names></string-name> (<year>2017</year>). <chapter-title>Svd-softmax: Fast softmax approximation on large vocabulary neural networks</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017</italic></source>, <conf-loc>Long Beach, CA, USA</conf-loc>, <conf-date>December 4–9, 2017</conf-date> (<string-name><given-names>I</given-names> <surname>Guyon</surname></string-name>, <string-name><given-names>U</given-names> <surname>von Luxburg</surname></string-name>, <string-name><given-names>S</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>HM</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>R</given-names> <surname>Fergus</surname></string-name>, <string-name><given-names>SVN</given-names> <surname>Vishwanathan</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <fpage>5463</fpage>–<lpage>5473</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_028">
<mixed-citation publication-type="journal"> <string-name><surname>Srivastava</surname> <given-names>N</given-names></string-name>, <string-name><surname>Hinton</surname> <given-names>G</given-names></string-name>, <string-name><surname>Krizhevsky</surname> <given-names>A</given-names></string-name>, <string-name><surname>Sutskever</surname> <given-names>I</given-names></string-name>, <string-name><surname>Salakhutdinov</surname> <given-names>R</given-names></string-name> (<year>2014</year>). <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>15</volume>(<issue>1</issue>): <fpage>1929</fpage>–<lpage>1958</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_029">
<mixed-citation publication-type="chapter"> <string-name><surname>Sun</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lu</surname> <given-names>W</given-names></string-name> (<year>2020</year>). <chapter-title>Understanding attention for text classification</chapter-title>. In: <source><italic>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020</italic></source>, <comment>Online</comment>, <conf-date>July 5–10, 2020</conf-date> (<string-name><given-names>D</given-names> <surname>Jurafsky</surname></string-name>, <string-name><given-names>J</given-names> <surname>Chai</surname></string-name>, <string-name><given-names>N</given-names> <surname>Schluter</surname></string-name>, <string-name><given-names>JR</given-names> <surname>Tetreault</surname></string-name>, eds.), <fpage>3418</fpage>–<lpage>3428</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_030">
<mixed-citation publication-type="chapter"> <string-name><surname>Titsias</surname> <given-names>MK</given-names></string-name> (<year>2016</year>). <chapter-title>One-vs-each approximation to softmax for scalable estimation of probabilities</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016</italic></source>, <conf-loc>Barcelona, Spain</conf-loc>, <conf-date>December 5–10, 2016</conf-date> (<string-name><given-names>DD</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>M</given-names> <surname>Sugiyama</surname></string-name>, <string-name><given-names>U</given-names> <surname>von Luxburg</surname></string-name>, <string-name><given-names>I</given-names> <surname>Guyon</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <fpage>4161</fpage>–<lpage>4169</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_031">
<mixed-citation publication-type="other"> <string-name><surname>Touvron</surname> <given-names>H</given-names></string-name>, <string-name><surname>Lavril</surname> <given-names>T</given-names></string-name>, <string-name><surname>Izacard</surname> <given-names>G</given-names></string-name>, <string-name><surname>Martinet</surname> <given-names>X</given-names></string-name>, <string-name><surname>Lachaux</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Lacroix</surname> <given-names>T</given-names></string-name>, et al. (<year>2023</year>). Llama: Open and efficient foundation language models. <italic>arXiv preprint:</italic> <uri>https://arxiv.org/abs/2302.13971</uri>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_032">
<mixed-citation publication-type="other"> <string-name><surname>Vashishth</surname> <given-names>S</given-names></string-name>, <string-name><surname>Upadhyay</surname> <given-names>S</given-names></string-name>, <string-name><surname>Tomar</surname> <given-names>GS</given-names></string-name>, <string-name><surname>Faruqui</surname> <given-names>M</given-names></string-name> (<year>2019</year>). Attention interpretability across NLP tasks. <italic>CoRR</italic>, abs/1909.11218.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_033">
<mixed-citation publication-type="chapter"> <string-name><surname>Vaswani</surname> <given-names>A</given-names></string-name>, <string-name><surname>Shazeer</surname> <given-names>N</given-names></string-name>, <string-name><surname>Parmar</surname> <given-names>N</given-names></string-name>, <string-name><surname>Uszkoreit</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>L</given-names></string-name>, <string-name><surname>Gomez</surname> <given-names>AN</given-names></string-name>, <etal>et al.</etal> (<year>2017</year>). <chapter-title>Attention is all you need</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017</italic></source>, <conf-loc>Long Beach, CA, USA</conf-loc>, <conf-date>December 4–9, 2017</conf-date> (<string-name><given-names>I</given-names> <surname>Guyon</surname></string-name>, <string-name><given-names>U</given-names> <surname>von Luxburg</surname></string-name>, <string-name><given-names>S</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>HM</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>R</given-names> <surname>Fergus</surname></string-name>, <string-name><given-names>SVN</given-names> <surname>Vishwanathan</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <fpage>5998</fpage>–<lpage>6008</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_034">
<mixed-citation publication-type="other"> <string-name><surname>Wang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>BZ</given-names></string-name>, <string-name><surname>Khabsa</surname> <given-names>M</given-names></string-name>, <string-name><surname>Fang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>H</given-names></string-name> (<year>2020</year>). Linformer: Self-attention with linear complexity. <italic>CoRR</italic>, abs/2006.04768.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_035">
<mixed-citation publication-type="chapter"> <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Girshick</surname> <given-names>RB</given-names></string-name>, <string-name><surname>Gupta</surname> <given-names>A</given-names></string-name>, <string-name><surname>He</surname> <given-names>K</given-names></string-name> (<year>2018</year>). <chapter-title>Non-local neural networks</chapter-title>. In: <source><italic>2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018</italic></source>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, <conf-date>June 18–22, 2018</conf-date> (<string-name><given-names>MS</given-names> <surname>Brown</surname></string-name>, <string-name><given-names>B</given-names> <surname>Morse</surname></string-name>, <string-name><given-names>S</given-names> <surname>Peleg</surname></string-name>, eds.), <fpage>7794</fpage>–<lpage>7803</lpage>. <publisher-name>Computer Vision Foundation / IEEE Computer Society</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_036">
<mixed-citation publication-type="chapter"> <string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Dai</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Salakhutdinov</surname> <given-names>R</given-names></string-name>, <string-name><surname>Cohen</surname> <given-names>WW</given-names></string-name> (<year>2018</year>). <chapter-title>Breaking the softmax bottleneck: A high-rank RNN language model</chapter-title>. In: <source><italic>6th International Conference on Learning Representations, ICLR 2018</italic></source>, <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>April 30–May 3, 2018</conf-date> (<string-name><given-names>Y</given-names> <surname>Bengio</surname></string-name>, <string-name><given-names>Y</given-names> <surname>LeCun</surname></string-name>, <string-name><given-names>T</given-names> <surname>Sainath</surname></string-name>, eds.), <comment>Conference Track Proceedings. OpenReview.net</comment>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_037">
<mixed-citation publication-type="chapter"> <string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Luong</surname> <given-names>T</given-names></string-name>, <string-name><surname>Salakhutdinov</surname> <given-names>R</given-names></string-name>, <string-name><surname>Le</surname> <given-names>QV</given-names></string-name> (<year>2019</year>). <chapter-title>Mixtape: Breaking the softmax bottleneck efficiently</chapter-title>. In: <source><italic>Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019,</italic></source> <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>December 8–14, 2019</conf-date> (<string-name><given-names>HM</given-names> <surname>Wallach</surname></string-name>, <string-name><given-names>H</given-names> <surname>Larochelle</surname></string-name>, <string-name><given-names>A</given-names> <surname>Beygelzimer</surname></string-name>, <string-name><given-names>F</given-names> <surname>d’Alché-Buc</surname></string-name>, <string-name><given-names>EB</given-names> <surname>Fox</surname></string-name>, <string-name><given-names>R</given-names> <surname>Garnett</surname></string-name>, eds.), <fpage>15922</fpage>–<lpage>15930</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1134_ref_038">
<mixed-citation publication-type="chapter"> <string-name><surname>Zhen</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>W</given-names></string-name>, <string-name><surname>Deng</surname> <given-names>H</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wei</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Lv</surname> <given-names>B</given-names></string-name>, <etal>et al.</etal> (<year>2022</year>). <chapter-title>Cosformer: Rethinking softmax in attention</chapter-title>. In: <source><italic>International Conference on Learning Representations</italic></source> (<string-name><given-names>K</given-names> <surname>Hofman</surname></string-name>, <string-name><given-names>A</given-names> <surname>Rush</surname></string-name>, <string-name><given-names>Y</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>C</given-names> <surname>Finn</surname></string-name>, <string-name><given-names>Y</given-names> <surname>Choi</surname></string-name>, <string-name><given-names>M</given-names> <surname>Deisenroth</surname></string-name>, eds.).</mixed-citation>
</ref>
</ref-list>
</back>
</article>
