SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR 文章

ArXiv CS.CL2026-06-18NEWSen作者: Siddharth Aphale, Kelly Liu

详细信息

来源站点
ArXiv CS.CL
作者
Siddharth Aphale, Kelly Liu
文章类型
NEWS
语言
en
发布日期
2026-06-18

摘要

arXiv:2606.18487v1 Announce Type: cross Abstract: The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($\rho{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据