<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1201</article-id>
<article-id pub-id-type="doi">10.6339/25-JDS1201</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Reward Collapse in Aligning Large Language Models</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Song</surname><given-names>Ziang</given-names></name><xref ref-type="aff" rid="j_jds1201_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Cai</surname><given-names>Tianle</given-names></name><xref ref-type="aff" rid="j_jds1201_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Lee</surname><given-names>Jason D.</given-names></name><xref ref-type="aff" rid="j_jds1201_aff_002">2</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Su</surname><given-names>Weijie J.</given-names></name><email xlink:href="mailto:suw@wharton.upenn.edu">suw@wharton.upenn.edu</email><xref ref-type="aff" rid="j_jds1201_aff_003">3</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1201_aff_001"><label>1</label><institution>Stanford University</institution>, <country>USA</country></aff>
<aff id="j_jds1201_aff_002"><label>2</label><institution>Princeton University</institution>, <country>USA</country></aff>
<aff id="j_jds1201_aff_003"><label>3</label><institution>University of Pennsylvania</institution>, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:suw@wharton.upenn.edu">suw@wharton.upenn.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2026</year></pub-date><pub-date pub-type="epub"><day>21</day><month>10</month><year>2025</year></pub-date><volume>24</volume><issue>1</issue><fpage>146</fpage><lpage>166</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds1201_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>The supplementary material contains detailed proofs for all theoretical results in the paper, more details of the experimental setup, and the accompanying source code of our experiments.</p>
</caption>
</supplementary-material><history><date date-type="received"><day>26</day><month>11</month><year>2024</year></date><date date-type="accepted"><day>5</day><month>10</month><year>2025</year></date></history>
<permissions><copyright-statement>2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of <italic>reward collapse</italic>, an empirical observation where the prevailing ranking-based approach results in an <italic>identical</italic> reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>alignment</kwd>
<kwd>human feedback</kwd>
<kwd>optimization</kwd>
<kwd>reward model</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1201_reflist_001">
<title>References</title>
<ref id="j_jds1201_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Amore</surname> <given-names>P</given-names></string-name>, <string-name><surname>Jacobo</surname> <given-names>M</given-names></string-name> (<year>2019</year>). <article-title>Thomson problem in one dimension: Minimal energy configurations of n charges on a curve</article-title>. <source><italic>Physica A. Statistical Mechanics and Its Applications</italic></source>, <volume>519</volume>: <fpage>256</fpage>–<lpage>266</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.physa.2018.12.040" xlink:type="simple">https://doi.org/10.1016/j.physa.2018.12.040</ext-link></mixed-citation>
</ref>
<ref id="j_jds1201_ref_002">
<mixed-citation publication-type="other"> <string-name><surname>Bahdanau</surname> <given-names>D</given-names></string-name>, <string-name><surname>Hill</surname> <given-names>F</given-names></string-name>, <string-name><surname>Leike</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hughes</surname> <given-names>E</given-names></string-name>, <string-name><surname>Hosseini</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kohli</surname> <given-names>P</given-names></string-name>, et al. (<year>2018</year>). Learning to understand goal specifications by modelling reward. arXiv preprint: <uri>https://arxiv.org/abs/1806.01946</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_003">
<mixed-citation publication-type="other"> <string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ndousse</surname> <given-names>K</given-names></string-name>, <string-name><surname>Askell</surname> <given-names>A</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>A</given-names></string-name>, <string-name><surname>DasSarma</surname> <given-names>N</given-names></string-name>, et al. (<year>2022</year>a). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint: <uri>https://arxiv.org/abs/2204.05862</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_004">
<mixed-citation publication-type="other"> <string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kadavath</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kundu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Askell</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kernion</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jones</surname> <given-names>A</given-names></string-name>, et al. (<year>2022</year>b). Constitutional AI: Harmlessness from AI feedback. arXiv preprint: <uri>https://arxiv.org/abs/2212.08073</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_005">
<mixed-citation publication-type="other"> <string-name><surname>Beeching</surname> <given-names>E</given-names></string-name>, <string-name><surname>Belkada</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Rasul</surname> <given-names>K</given-names></string-name>, <string-name><surname>Tunstall</surname> <given-names>L</given-names></string-name>, <string-name><surname>von Werra</surname> <given-names>L</given-names></string-name>, <string-name><surname>Rajani</surname> <given-names>N</given-names></string-name>, et al. (<year>2023</year>). StackLLaMA: An RL Finetuned LLaMA Model for Stack Exchange Question and Answering. See <ext-link ext-link-type="uri" xlink:href="https://huggingface.co/blog/stackllama">https://huggingface.co/blog/stackllama</ext-link> (accessed 14 April 2023).</mixed-citation>
</ref>
<ref id="j_jds1201_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Black</surname> <given-names>S</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <string-name><surname>Leahy</surname> <given-names>C</given-names></string-name>, <string-name><surname>Biderman</surname> <given-names>S</given-names></string-name> (<year>2021</year>). <article-title>GPT-Neo: Large-scale autoregressive language modeling with Mesh-Tensorflow</article-title>. <source><italic>Zenodo</italic></source>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.5281/zenodo.5297715" xlink:type="simple">https://doi.org/10.5281/zenodo.5297715</ext-link></mixed-citation>
</ref>
<ref id="j_jds1201_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Bowick</surname> <given-names>M</given-names></string-name>, <string-name><surname>Cacciuto</surname> <given-names>A</given-names></string-name>, <string-name><surname>Nelson</surname> <given-names>DR</given-names></string-name>, <string-name><surname>Travesset</surname> <given-names>A</given-names></string-name> (<year>2002</year>). <article-title>Crystalline order on a sphere and the generalized Thomson problem</article-title>. <source><italic>Physical Review Letters</italic></source>, <volume>89</volume>(<issue>18</issue>): <elocation-id>185502</elocation-id>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1103/PhysRevLett.89.185502" xlink:type="simple">https://doi.org/10.1103/PhysRevLett.89.185502</ext-link></mixed-citation>
</ref>
<ref id="j_jds1201_ref_008">
<mixed-citation publication-type="book"> <string-name><surname>Boyd</surname> <given-names>SP</given-names></string-name>, <string-name><surname>Vandenberghe</surname> <given-names>L</given-names></string-name> (<year>2004</year>). <source><italic>Convex Optimization</italic></source>. <publisher-name>Cambridge university press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_009">
<mixed-citation publication-type="other"> <string-name><surname>Christiano</surname> <given-names>PF</given-names></string-name>, <string-name><surname>Leike</surname> <given-names>J</given-names></string-name>, <string-name><surname>Brown</surname> <given-names>T</given-names></string-name>, <string-name><surname>Martic</surname> <given-names>M</given-names></string-name>, <string-name><surname>Legg</surname> <given-names>S</given-names></string-name>, <string-name><surname>Amodei</surname> <given-names>D</given-names></string-name> (<year>2017</year>). Deep reinforcement learning from human preferences. <italic>Advances in neural information processing systems</italic>, 30.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_010">
<mixed-citation publication-type="other"> <string-name><surname>Desai</surname> <given-names>S</given-names></string-name>, <string-name><surname>Durrett</surname> <given-names>G</given-names></string-name> (<year>2020</year>). Calibration of pre-trained transformers. arXiv preprint: <uri>https://arxiv.org/abs/2003.07892</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_011">
<mixed-citation publication-type="chapter"> <string-name><surname>Ethayarajh</surname> <given-names>K</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Swayamdipta</surname> <given-names>S</given-names></string-name> (<year>2022</year>). <chapter-title>Understanding dataset difficulty with V-usable information</chapter-title>. In: <string-name><surname>Chaudhuri</surname> <given-names>K.</given-names></string-name>, <string-name><surname>Jegelka</surname> <given-names>S.</given-names></string-name>, <string-name><surname>Le Song</surname></string-name>, <string-name><surname>Szepesvári</surname> <given-names>C.</given-names></string-name>, <string-name><surname>Niu</surname> <given-names>G.</given-names></string-name>, <string-name><surname>Sabato</surname> <given-names>S.</given-names></string-name> (eds.) <source><italic>Proceedings of the 39th International Conference on Machine Learning</italic></source>, <fpage>5988</fpage>–<lpage>6008</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_012">
<mixed-citation publication-type="other"> <string-name><surname>Ganguli</surname> <given-names>D</given-names></string-name>, <string-name><surname>Lovitt</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kernion</surname> <given-names>J</given-names></string-name>, <string-name><surname>Askell</surname> <given-names>A</given-names></string-name>, <string-name><surname>Bai</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Kadavath</surname> <given-names>S</given-names></string-name>, et al. (<year>2022</year>). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint: <uri>https://arxiv.org/abs/2209.07858</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_013">
<mixed-citation publication-type="other"> <string-name><surname>Gu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Tan</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhai</surname> <given-names>X</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>C</given-names></string-name>, et al. (<year>2024</year>). A survey on llm-as-a-judge. arXiv preprint: <uri>https://arxiv.org/abs/2411.15594</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_014">
<mixed-citation publication-type="chapter"> <string-name><surname>Guo</surname> <given-names>C</given-names></string-name>, <string-name><surname>Pleiss</surname> <given-names>G</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Weinberger</surname> <given-names>KQ</given-names></string-name> (<year>2017</year>). <chapter-title>On calibration of modern neural networks</chapter-title>. In: <string-name><surname>Precup</surname> <given-names>D.</given-names></string-name>, <string-name><surname>Teh</surname> <given-names>Y. W.</given-names></string-name> (eds.) <source><italic>Proceedings of the 34th International Conference on Machine Learning</italic></source>, <fpage>1321</fpage>–<lpage>1330</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Hardin</surname> <given-names>DP</given-names></string-name>, <string-name><surname>Saff</surname> <given-names>EB</given-names></string-name>, <etal>et al.</etal> (<year>2004</year>). <article-title>Discretizing manifolds via minimum energy points</article-title>. <source><italic>Notices of the American Mathematical Society</italic></source>, <volume>51</volume>(<issue>10</issue>): <fpage>1186</fpage>–<lpage>1194</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_016">
<mixed-citation publication-type="other"> <string-name><surname>He</surname> <given-names>P</given-names></string-name>, <string-name><surname>Gao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>W</given-names></string-name> (<year>2021</year>). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: <uri>https://arxiv.org/abs/2111.09543</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_017">
<mixed-citation publication-type="other"> <string-name><surname>Ibarz</surname> <given-names>B</given-names></string-name>, <string-name><surname>Leike</surname> <given-names>J</given-names></string-name>, <string-name><surname>Pohlen</surname> <given-names>T</given-names></string-name>, <string-name><surname>Irving</surname> <given-names>G</given-names></string-name>, <string-name><surname>Legg</surname> <given-names>S</given-names></string-name>, <string-name><surname>Amodei</surname> <given-names>D</given-names></string-name> (<year>2018</year>). Reward learning from human preferences and demonstrations in Atari. <italic>Advances in neural information processing systems</italic>, 31.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_018">
<mixed-citation publication-type="other"> <string-name><surname>Kadavath</surname> <given-names>S</given-names></string-name>, <string-name><surname>Conerly</surname> <given-names>T</given-names></string-name>, <string-name><surname>Askell</surname> <given-names>A</given-names></string-name>, <string-name><surname>Henighan</surname> <given-names>T</given-names></string-name>, <string-name><surname>Drain</surname> <given-names>D</given-names></string-name>, <string-name><surname>Perez</surname> <given-names>E</given-names></string-name>, et al. (<year>2022</year>). Language models (mostly) know what they know. arXiv preprint: <uri>https://arxiv.org/abs/2207.05221</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_019">
<mixed-citation publication-type="other"> <string-name><surname>Köksal</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schick</surname> <given-names>T</given-names></string-name>, <string-name><surname>Korhonen</surname> <given-names>A</given-names></string-name>, <string-name><surname>Schütze</surname> <given-names>H</given-names></string-name> (<year>2023</year>). Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint: <uri>https://arxiv.org/abs/2304.08460</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_020">
<mixed-citation publication-type="other"> <string-name><surname>Lambert</surname> <given-names>N</given-names></string-name>, <string-name><surname>Tunstall</surname> <given-names>L</given-names></string-name>, <string-name><surname>Rajani</surname> <given-names>N</given-names></string-name>, <string-name><surname>Thrush</surname> <given-names>T</given-names></string-name> (<year>2023</year>). Huggingface h4 stack exchange preference dataset. URL: <ext-link ext-link-type="uri" xlink:href="https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences">https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_021">
<mixed-citation publication-type="book"> <string-name><surname>Landkof</surname> <given-names>NS</given-names></string-name>, <string-name><surname>Landkof</surname> <given-names>N</given-names></string-name> (<year>1972</year>). <source><italic>Foundations of Modern Potential Theory</italic></source>, volume <volume>180</volume>. <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_022">
<mixed-citation publication-type="other"> <string-name><surname>Lin</surname> <given-names>S</given-names></string-name>, <string-name><surname>Hilton</surname> <given-names>J</given-names></string-name>, <string-name><surname>Evans</surname> <given-names>O</given-names></string-name> (<year>2022</year>). Teaching models to express their uncertainty in words. arXiv preprint: <uri>https://arxiv.org/abs/2205.14334</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_023">
<mixed-citation publication-type="other"> <string-name><surname>Liu</surname> <given-names>H</given-names></string-name>, <string-name><surname>Sferrazza</surname> <given-names>C</given-names></string-name>, <string-name><surname>Abbeel</surname> <given-names>P</given-names></string-name> (<year>2023</year>). Chain of hindsight aligns language models with feedback. arXiv preprint: <uri>https://arxiv.org/abs/2302.02676</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_024">
<mixed-citation publication-type="journal"> <string-name><surname>Martinez-Finkelshtein</surname> <given-names>A</given-names></string-name>, <string-name><surname>Maymeskul</surname> <given-names>V</given-names></string-name>, <string-name><surname>Rakhmanov</surname> <given-names>E</given-names></string-name>, <string-name><surname>Saff</surname> <given-names>E</given-names></string-name> (<year>2004</year>). <article-title>Asymptotics for minimal discrete Riesz energy on curves in <inline-formula id="j_jds1201_ineq_001"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">d</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathbb{R}^{d}}$]]></tex-math></alternatives></inline-formula></article-title>. <source><italic>Canadian Journal of Mathematics</italic></source>, <volume>56</volume>(<issue>3</issue>): <fpage>529</fpage>–<lpage>552</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.4153/CJM-2004-024-1" xlink:type="simple">https://doi.org/10.4153/CJM-2004-024-1</ext-link></mixed-citation>
</ref>
<ref id="j_jds1201_ref_025">
<mixed-citation publication-type="other"> <string-name><surname>Nakano</surname> <given-names>R</given-names></string-name>, <string-name><surname>Hilton</surname> <given-names>J</given-names></string-name>, <string-name><surname>Balaji</surname> <given-names>S</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ouyang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>C</given-names></string-name>, et al. (<year>2021</year>). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint: <uri>https://arxiv.org/abs/2112.09332/</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_026">
<mixed-citation publication-type="other"> <collab>OpenAI</collab> (<year>2023</year>). GPT-4 technical report. arXiv preprint: <uri>https://arxiv.org/abs/2303.08774</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_027">
<mixed-citation publication-type="journal"> <string-name><surname>Ouyang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Almeida</surname> <given-names>D</given-names></string-name>, <string-name><surname>Wainwright</surname> <given-names>C</given-names></string-name>, <string-name><surname>Mishkin</surname> <given-names>P</given-names></string-name>, <etal>et al.</etal> (<year>2022</year>). <article-title>Training language models to follow instructions with human feedback</article-title>. <source><italic>Advances in Neural Information Processing Systems</italic></source>, <volume>35</volume>: <fpage>27730</fpage>–<lpage>27744</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_028">
<mixed-citation publication-type="other"> <string-name><surname>Padmakumar</surname> <given-names>V</given-names></string-name>, <string-name><surname>He</surname> <given-names>H</given-names></string-name> (<year>2023</year>). Does writing with language models reduce content diversity? arXiv preprint: <uri>https://arxiv.org/abs/2309.05196</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_029">
<mixed-citation publication-type="journal"> <string-name><surname>Papyan</surname> <given-names>V</given-names></string-name>, <string-name><surname>Han</surname> <given-names>X</given-names></string-name>, <string-name><surname>Donoho</surname> <given-names>DL</given-names></string-name> (<year>2020</year>). <article-title>Prevalence of neural collapse during the terminal phase of deep learning training</article-title>. <source><italic>Proceedings of the National Academy of Sciences of the United States of America</italic></source>, <volume>117</volume>(<issue>40</issue>): <fpage>24652</fpage>–<lpage>24663</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1073/pnas.2015509117" xlink:type="simple">https://doi.org/10.1073/pnas.2015509117</ext-link></mixed-citation>
</ref>
<ref id="j_jds1201_ref_030">
<mixed-citation publication-type="other"> <string-name><surname>Rafailov</surname> <given-names>R</given-names></string-name>, <string-name><surname>Sharma</surname> <given-names>A</given-names></string-name>, <string-name><surname>Mitchell</surname> <given-names>E</given-names></string-name>, <string-name><surname>Manning</surname> <given-names>CD</given-names></string-name>, <string-name><surname>Ermon</surname> <given-names>S</given-names></string-name>, <string-name><surname>Finn</surname> <given-names>C</given-names></string-name> (<year>2024</year>). Direct preference optimization: Your language model is secretly a reward model. <italic>Advances in Neural Information Processing Systems</italic>, 36.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_031">
<mixed-citation publication-type="other"> <string-name><surname>Sangchul Lee</surname></string-name> (<year>2017</year>). Expected absolute difference between two iid variables. Mathematics Stack Exchange. URL: <ext-link ext-link-type="uri" xlink:href="https://math.stackexchange.com/q/2542224">https://math.stackexchange.com/q/2542224</ext-link>. (version: 2017-11-29).</mixed-citation>
</ref>
<ref id="j_jds1201_ref_032">
<mixed-citation publication-type="other"> <string-name><surname>Sun</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Shen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Cox</surname> <given-names>D</given-names></string-name>, et al. (<year>2023</year>). Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint: <uri>https://arxiv.org/abs/2305.03047</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_033">
<mixed-citation publication-type="other"> <string-name><surname>Zhu</surname> <given-names>B</given-names></string-name>, <string-name><surname>Jiao</surname> <given-names>J</given-names></string-name>, <string-name><surname>Jordan</surname> <given-names>MI</given-names></string-name> (<year>2023</year>). Principled reinforcement learning with human feedback from pairwise or <italic>k</italic>-wise comparisons. arXiv preprint: <uri>https://arxiv.org/abs/2301.11270</uri>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_034">
<mixed-citation publication-type="journal"> <string-name><surname>Zhu</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>X</given-names></string-name> (<year>2024</year>). <article-title>Promptbench: A unified library for evaluation of large language models</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>25</volume>(<issue>254</issue>): <fpage>1</fpage>–<lpage>22</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1201_ref_035">
<mixed-citation publication-type="other"> <string-name><surname>Ziegler</surname> <given-names>DM</given-names></string-name>, <string-name><surname>Stiennon</surname> <given-names>N</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Brown</surname> <given-names>TB</given-names></string-name>, <string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Amodei</surname> <given-names>D</given-names></string-name>, et al. (<year>2019</year>). Fine-tuning language models from human preferences. arXiv preprint: <uri>https://arxiv.org/abs/1909.08593</uri>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
