Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data 文章

ArXiv CS.AI2026-06-04NEWSen作者: Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

摘要

arXiv:2601.15158v4 Announce Type: replace-cross Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps.

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (5)