Reward Collapse in Aligning Large Language Models
Pub. online: 21 October 2025
Type: Statistical Data Science
Open Access
Received
26 November 2024
26 November 2024
Accepted
5 October 2025
5 October 2025
Published
21 October 2025
21 October 2025
Abstract
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of reward collapse, an empirical observation where the prevailing ranking-based approach results in an identical reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
Supplementary material
Supplementary MaterialThe supplementary material contains detailed proofs for all theoretical results in the paper, more details of the experimental setup, and the accompanying source code of our experiments.
References
Amore P, Jacobo M (2019). Thomson problem in one dimension: Minimal energy configurations of n charges on a curve. Physica A. Statistical Mechanics and Its Applications, 519: 256–266. https://doi.org/10.1016/j.physa.2018.12.040
Bahdanau D, Hill F, Leike J, Hughes E, Hosseini A, Kohli P, et al. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint: https://arxiv.org/abs/1806.01946.
Bai Y, Jones A, Ndousse K, Askell A, Chen A, DasSarma N, et al. (2022a). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint: https://arxiv.org/abs/2204.05862.
Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. (2022b). Constitutional AI: Harmlessness from AI feedback. arXiv preprint: https://arxiv.org/abs/2212.08073.
Beeching E, Belkada Y, Rasul K, Tunstall L, von Werra L, Rajani N, et al. (2023). StackLLaMA: An RL Finetuned LLaMA Model for Stack Exchange Question and Answering. See https://huggingface.co/blog/stackllama (accessed 14 April 2023).
Black S, Gao L, Wang P, Leahy C, Biderman S (2021). GPT-Neo: Large-scale autoregressive language modeling with Mesh-Tensorflow. Zenodo. https://doi.org/10.5281/zenodo.5297715
Bowick M, Cacciuto A, Nelson DR, Travesset A (2002). Crystalline order on a sphere and the generalized Thomson problem. Physical Review Letters, 89(18): 185502. https://doi.org/10.1103/PhysRevLett.89.185502
Desai S, Durrett G (2020). Calibration of pre-trained transformers. arXiv preprint: https://arxiv.org/abs/2003.07892.
Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, Kadavath S, et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint: https://arxiv.org/abs/2209.07858.
Gu J, Jiang X, Shi Z, Tan H, Zhai X, Xu C, et al. (2024). A survey on llm-as-a-judge. arXiv preprint: https://arxiv.org/abs/2411.15594.
He P, Gao J, Chen W (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.
Kadavath S, Conerly T, Askell A, Henighan T, Drain D, Perez E, et al. (2022). Language models (mostly) know what they know. arXiv preprint: https://arxiv.org/abs/2207.05221.
Köksal A, Schick T, Korhonen A, Schütze H (2023). Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint: https://arxiv.org/abs/2304.08460.
Lambert N, Tunstall L, Rajani N, Thrush T (2023). Huggingface h4 stack exchange preference dataset. URL: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
Lin S, Hilton J, Evans O (2022). Teaching models to express their uncertainty in words. arXiv preprint: https://arxiv.org/abs/2205.14334.
Liu H, Sferrazza C, Abbeel P (2023). Chain of hindsight aligns language models with feedback. arXiv preprint: https://arxiv.org/abs/2302.02676.
Martinez-Finkelshtein A, Maymeskul V, Rakhmanov E, Saff E (2004). Asymptotics for minimal discrete Riesz energy on curves in ${\mathbb{R}^{d}}$. Canadian Journal of Mathematics, 56(3): 529–552. https://doi.org/10.4153/CJM-2004-024-1
Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint: https://arxiv.org/abs/2112.09332/.
OpenAI (2023). GPT-4 technical report. arXiv preprint: https://arxiv.org/abs/2303.08774.
Padmakumar V, He H (2023). Does writing with language models reduce content diversity? arXiv preprint: https://arxiv.org/abs/2309.05196.
Papyan V, Han X, Donoho DL (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America, 117(40): 24652–24663. https://doi.org/10.1073/pnas.2015509117
Sangchul Lee (2017). Expected absolute difference between two iid variables. Mathematics Stack Exchange. URL: https://math.stackexchange.com/q/2542224. (version: 2017-11-29).
Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, et al. (2023). Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint: https://arxiv.org/abs/2305.03047.
Zhu B, Jiao J, Jordan MI (2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. arXiv preprint: https://arxiv.org/abs/2301.11270.
Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. (2019). Fine-tuning language models from human preferences. arXiv preprint: https://arxiv.org/abs/1909.08593.