OPD+: Rethinking the Advantage Design for On-Policy Distillation 文章

ArXiv CS.AI2026-06-02NEWSen作者: Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

摘要

arXiv:2606.01039v1 Announce Type: cross Abstract: On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据