Trust Region On-Policy Distillation 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
Trust Region On-Policy Distillation arXiv:2606.01249v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even caus