Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1201

10.6339/25-JDS1201

Statistical Data Science

Reward Collapse in Aligning Large Language Models

Song

Ziang

1 Cai

Tianle

2 Lee

Jason D.

2 Su

Weijie J.

suw@wharton.upenn.edu3∗ 1Stanford University, USA 2Princeton University, USA 3University of Pennsylvania, USA

∗Corresponding author. Email: suw@wharton.upenn.edu.

2026

21102025

241146166

Supplementary Material

The supplementary material contains detailed proofs for all theoretical results in the paper, more details of the experimental setup, and the accompanying source code of our experiments.

261120245102025

2026 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2026

Open access article under the CC BY license.

The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of reward collapse, an empirical observation where the prevailing ranking-based approach results in an identical reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like “write a short story about your best friend” should yield a continuous range of rewards for their completions, while specific prompts like “what is the capital city of New Zealand” should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. Then we derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. Based on the reward distributions for different utility functions, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.

Keywords alignment human feedback optimization reward model

References

Amore

, Jacobo

(2019). Thomson problem in one dimension: Minimal energy configurations of n charges on a curve. Physica A. Statistical Mechanics and Its Applications, 519: 256–266. https://doi.org/10.1016/j.physa.2018.12.040

Bahdanau

, Hill

, Leike

, Hughes

, Hosseini

, Kohli

, et al. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint: https://arxiv.org/abs/1806.01946.

Bai

, Jones

, Ndousse

, Askell

, Chen

, DasSarma

, et al. (2022a). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint: https://arxiv.org/abs/2204.05862.

Bai

, Kadavath

, Kundu

, Askell

, Kernion

, Jones

, et al. (2022b). Constitutional AI: Harmlessness from AI feedback. arXiv preprint: https://arxiv.org/abs/2212.08073.

Beeching

, Belkada

, Rasul

, Tunstall

, von Werra

, Rajani

, et al. (2023). StackLLaMA: An RL Finetuned LLaMA Model for Stack Exchange Question and Answering. See https://huggingface.co/blog/stackllama (accessed 14 April 2023).

Black

, Gao

, Wang

, Leahy

, Biderman

(2021). GPT-Neo: Large-scale autoregressive language modeling with Mesh-Tensorflow. Zenodo. https://doi.org/10.5281/zenodo.5297715

Bowick

, Cacciuto

, Nelson

, Travesset

(2002). Crystalline order on a sphere and the generalized Thomson problem. Physical Review Letters, 89(18): 185502. https://doi.org/10.1103/PhysRevLett.89.185502

Boyd

, Vandenberghe

(2004). Convex Optimization. Cambridge university press.

Christiano

, Leike

, Brown

, Martic

, Legg

, Amodei

(2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Desai

, Durrett

(2020). Calibration of pre-trained transformers. arXiv preprint: https://arxiv.org/abs/2003.07892.

Ethayarajh

, Choi

, Swayamdipta

(2022). Understanding dataset difficulty with V-usable information. In: Chaudhuri

, Jegelka

, Le Song, Szepesvári

, Niu

, Sabato

(eds.) Proceedings of the 39th International Conference on Machine Learning, 5988–6008. PMLR.

Ganguli

, Lovitt

, Kernion

, Askell

, Bai

, Kadavath

, et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint: https://arxiv.org/abs/2209.07858.

, Jiang

, Shi

, Tan

, Zhai

, Xu

, et al. (2024). A survey on llm-as-a-judge. arXiv preprint: https://arxiv.org/abs/2411.15594.

Guo

, Pleiss

, Sun

, Weinberger

(2017). On calibration of modern neural networks. In: Precup

, Teh

Y. W.

(eds.) Proceedings of the 34th International Conference on Machine Learning, 1321–1330. PMLR.

Hardin

, Saff

, et al. (2004). Discretizing manifolds via minimum energy points. Notices of the American Mathematical Society, 51(10): 1186–1194.

, Gao

, Chen

(2021). DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv preprint: https://arxiv.org/abs/2111.09543.

Ibarz

, Leike

, Pohlen

, Irving

, Legg

, Amodei

(2018). Reward learning from human preferences and demonstrations in Atari. Advances in neural information processing systems, 31.

Kadavath

, Conerly

, Askell

, Henighan

, Drain

, Perez

, et al. (2022). Language models (mostly) know what they know. arXiv preprint: https://arxiv.org/abs/2207.05221.

Köksal

, Schick

, Korhonen

, Schütze

(2023). Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint: https://arxiv.org/abs/2304.08460.

Lambert

, Tunstall

, Rajani

, Thrush

(2023). Huggingface h4 stack exchange preference dataset. URL: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

Landkof

, Landkof

(1972). Foundations of Modern Potential Theory, volume 180. Springer.

Lin

, Hilton

, Evans

(2022). Teaching models to express their uncertainty in words. arXiv preprint: https://arxiv.org/abs/2205.14334.

Liu

, Sferrazza

, Abbeel

(2023). Chain of hindsight aligns language models with feedback. arXiv preprint: https://arxiv.org/abs/2302.02676.

Martinez-Finkelshtein

, Maymeskul

, Rakhmanov

, Saff

(2004). Asymptotics for minimal discrete Riesz energy on curves in R d

. Canadian Journal of Mathematics, 56(3): 529–552. https://doi.org/10.4153/CJM-2004-024-1

Nakano

, Hilton

, Balaji

, Wu

, Ouyang

, Kim

, et al. (2021). WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint: https://arxiv.org/abs/2112.09332/.

OpenAI (2023). GPT-4 technical report. arXiv preprint: https://arxiv.org/abs/2303.08774.

Ouyang

, Wu

, Jiang

, Almeida

, Wainwright

, Mishkin

, et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.

Padmakumar

, He

(2023). Does writing with language models reduce content diversity? arXiv preprint: https://arxiv.org/abs/2309.05196.

Papyan

, Han

, Donoho

(2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America, 117(40): 24652–24663. https://doi.org/10.1073/pnas.2015509117

Rafailov

, Sharma

, Mitchell

, Manning

, Ermon

, Finn

(2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Sangchul Lee (2017). Expected absolute difference between two iid variables. Mathematics Stack Exchange. URL: https://math.stackexchange.com/q/2542224. (version: 2017-11-29).

Sun

, Shen

, Zhou

, Zhang

, Chen

, Cox

, et al. (2023). Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint: https://arxiv.org/abs/2305.03047.

Zhu

, Jiao

, Jordan

(2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. arXiv preprint: https://arxiv.org/abs/2301.11270.

Zhu

, Zhao

, Chen

, Wang

, Xie

(2024). Promptbench: A unified library for evaluation of large language models. Journal of Machine Learning Research, 25(254): 1–22.

Ziegler

, Stiennon

, Wu

, Brown

, Radford

, Amodei

, et al. (2019). Fine-tuning language models from human preferences. arXiv preprint: https://arxiv.org/abs/1909.08593.