A Predictive Law for On-Policy Self-Distillation From World Feedback 文章

ArXiv CS.AI2026-05-29NEWSen作者: Tommy He, Jerome Sieber, Matteo Saponati

摘要

arXiv:2605.30070v1 Announce Type: cross Abstract: Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据