Trust Region On-Policy Distillation 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Trust Region On-Policy Distillation arXiv:2606.01249v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even caus

Trust Region On-Policy Distillation · 相关人物

暂无数据