Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform th
相关产品查看全部 (10)
相关报道查看全部 (1)
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
ArXiv CS.CL2026-05-29