The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes 文章

ArXiv CS.AI2026-05-26NEWSen作者: Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

摘要

arXiv:2605.11182v2 Announce Type: replace Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference.