Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems 文章

ArXiv CS.CL2026-06-11NEWSen作者: Christian Walder, Deep Karkhanis

详细信息

来源站点
ArXiv CS.CL
作者
Christian Walder, Deep Karkhanis
文章类型
NEWS
语言
en
发布日期
2026-06-11

摘要

arXiv:2505.15201v5 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据