RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning 文章

ArXiv CS.AI2026-06-02NEWSen作者: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

摘要

arXiv:2606.01281v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints.