Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data 事件

PRODUCT_LAUNCH2026-06-04影响: MEDIUM

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data arXiv:2601.15158v4 Announce Type: replace-cross Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy grad

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data · 相关产品