Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles 文章

ArXiv CS.AI2026-06-01NEWSen作者: Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

摘要

arXiv:2605.30619v1 Announce Type: cross Abstract: Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to choose $N$ and the base distribution, remain unclear. We specialize a recent analysis of preference data via its induced conditional distribution to Best-of-$N$. For independent-reference variants, we derive closed-form reward targets as explicit functions of $N$ and the base distribution, and show that they preserve the latent reward ranking. For the practical Best-vs-Random and Best-vs-Worst variants, chosen and rejected responses are coupled through the same candidate set, so exact BT representability generally fails; nevertheless, bounded-class minimizers approach the reference targets as $N$ grows.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据