Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform th