Reward Collapse in Aligning Large Language Models

Song, Ziang; Cai, Tianle; Lee, Jason D.; Su, Weijie J.

doi:10.6339/25-JDS1201

Journal of Data Science

Reward Collapse in Aligning Large Language Models

Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 146–166

Ziang Song Tianle Cai Jason D. Lee All authors (4)

https://doi.org/10.6339/25-JDS1201

Pub. online: 21 October 2025 Type: Statistical Data Science

Open Access

Received
26 November 2024

Accepted
5 October 2025

Published
21 October 2025

Abstract

The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of reward collapse, an empirical observation where the prevailing ranking-based approach results in an identical reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.

Supplementary material

Supplementary Material

The supplementary material contains detailed proofs for all theoretical results in the paper, more details of the experimental setup, and the accompanying source code of our experiments.

References

Amore P, Jacobo M (2019). Thomson problem in one dimension: Minimal energy configurations of n charges on a curve. Physica A. Statistical Mechanics and Its Applications, 519: 256–266. https://doi.org/10.1016/j.physa.2018.12.040

Bahdanau D, Hill F, Leike J, Hughes E, Hosseini A, Kohli P, et al. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint: https://arxiv.org/abs/1806.01946.

Bai Y, Jones A, Ndousse K, Askell A, Chen A, DasSarma N, et al. (2022a). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint: https://arxiv.org/abs/2204.05862.

Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. (2022b). Constitutional AI: Harmlessness from AI feedback. arXiv preprint: https://arxiv.org/abs/2212.08073.

Beeching E, Belkada Y, Rasul K, Tunstall L, von Werra L, Rajani N, et al. (2023). StackLLaMA: An RL Finetuned LLaMA Model for Stack Exchange Question and Answering. See https://huggingface.co/blog/stackllama (accessed 14 April 2023).

Black S, Gao L, Wang P, Leahy C, Biderman S (2021). GPT-Neo: Large-scale autoregressive language modeling with Mesh-Tensorflow. Zenodo. https://doi.org/10.5281/zenodo.5297715

Bowick M, Cacciuto A, Nelson DR, Travesset A (2002). Crystalline order on a sphere and the generalized Thomson problem. Physical Review Letters, 89(18): 185502. https://doi.org/10.1103/PhysRevLett.89.185502

Boyd SP, Vandenberghe L (2004). Convex Optimization. Cambridge university press.

Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Desai S, Durrett G (2020). Calibration of pre-trained transformers. arXiv preprint: https://arxiv.org/abs/2003.07892.

Ethayarajh K, Choi Y, Swayamdipta S (2022). Understanding dataset difficulty with V-usable information. In: Chaudhuri K., Jegelka S., Le Song, Szepesvári C., Niu G., Sabato S. (eds.) Proceedings of the 39th International Conference on Machine Learning, 5988–6008. PMLR.

Ganguli D, Lovitt L, Kernion J, Askell A, Bai Y, Kadavath S, et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint: https://arxiv.org/abs/2209.07858.

Gu J, Jiang X, Shi Z, Tan H, Zhai X, Xu C, et al. (2024). A survey on llm-as-a-judge. arXiv preprint: https://arxiv.org/abs/2411.15594.

Guo C, Pleiss G, Sun Y, Weinberger KQ (2017). On calibration of modern neural networks. In: Precup D., Teh Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, 1321–1330. PMLR.

Hardin DP, Saff EB, et al. (2004). Discretizing manifolds via minimum energy points. Notices of the American Mathematical Society, 51(10): 1186–1194.

He P, Gao J, Chen W (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.

Ibarz B, Leike J, Pohlen T, Irving G, Legg S, Amodei D (2018). Reward learning from human preferences and demonstrations in Atari. Advances in neural information processing systems, 31.

Kadavath S, Conerly T, Askell A, Henighan T, Drain D, Perez E, et al. (2022). Language models (mostly) know what they know. arXiv preprint: https://arxiv.org/abs/2207.05221.

Köksal A, Schick T, Korhonen A, Schütze H (2023). Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint: https://arxiv.org/abs/2304.08460.

Lambert N, Tunstall L, Rajani N, Thrush T (2023). Huggingface h4 stack exchange preference dataset. URL: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

Landkof NS, Landkof N (1972). Foundations of Modern Potential Theory, volume 180. Springer.

Lin S, Hilton J, Evans O (2022). Teaching models to express their uncertainty in words. arXiv preprint: https://arxiv.org/abs/2205.14334.

Liu H, Sferrazza C, Abbeel P (2023). Chain of hindsight aligns language models with feedback. arXiv preprint: https://arxiv.org/abs/2302.02676.

Martinez-Finkelshtein A, Maymeskul V, Rakhmanov E, Saff E (2004). Asymptotics for minimal discrete Riesz energy on curves in ${\mathbb{R}^{d}}$. Canadian Journal of Mathematics, 56(3): 529–552. https://doi.org/10.4153/CJM-2004-024-1

Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint: https://arxiv.org/abs/2112.09332/.

OpenAI (2023). GPT-4 technical report. arXiv preprint: https://arxiv.org/abs/2303.08774.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.

Padmakumar V, He H (2023). Does writing with language models reduce content diversity? arXiv preprint: https://arxiv.org/abs/2309.05196.

Papyan V, Han X, Donoho DL (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America, 117(40): 24652–24663. https://doi.org/10.1073/pnas.2015509117

Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Sangchul Lee (2017). Expected absolute difference between two iid variables. Mathematics Stack Exchange. URL: https://math.stackexchange.com/q/2542224. (version: 2017-11-29).

Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, et al. (2023). Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint: https://arxiv.org/abs/2305.03047.

Zhu B, Jiao J, Jordan MI (2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. arXiv preprint: https://arxiv.org/abs/2301.11270.

Zhu K, Zhao Q, Chen H, Wang J, Xie X (2024). Promptbench: A unified library for evaluation of large language models. Journal of Machine Learning Research, 25(254): 1–22.

Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. (2019). Fine-tuning language models from human preferences. arXiv preprint: https://arxiv.org/abs/1909.08593.

2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

alignment human feedback optimization reward model

Metrics

since February 2021

1060

Article info
views

581

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file