Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data 事件
PRODUCT_LAUNCH2026-06-04影响: MEDIUM
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data arXiv:2601.15158v4 Announce Type: replace-cross Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy grad
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data · 相关报道
相关报道
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
ArXiv CS.AI2026-06-04