SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks 文章

ArXiv CS.CL2026-06-01NEWSen作者: Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

摘要

arXiv:2605.31433v1 Announce Type: new Abstract: Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (8)

相关技术查看全部 (1)