Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning 文章

ArXiv CS.CL2026-06-02NEWSen作者: Yilong Li, Suman Banerjee, Tong Che

摘要

arXiv:2605.27000v2 Announce Type: replace Abstract: Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method.