Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems 文章

ArXiv CS.CL2026-06-11NEWSen作者: Christian Walder, Deep Karkhanis

详细信息

来源站点: ArXiv CS.CL
作者: Christian Walder, Deep Karkhanis
文章类型: NEWS
语言: en
发布日期: 2026-06-11

摘要

arXiv:2505.15201v5 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings.

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (4)