Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning 文章

ArXiv CS.CL2026-05-29NEWSen作者: Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

摘要

arXiv:2602.01058v2 Announce Type: replace-cross Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels.

相关公司

暂无数据

相关人物

暂无数据