OPD+: Rethinking the Advantage Design for On-Policy Distillation 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
OPD+: Rethinking the Advantage Design for On-Policy Distillation arXiv:2606.01039v1 Announce Type: cross Abstract: On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primar
相关产品查看全部 (10)
相关报道查看全部 (1)
OPD+: Rethinking the Advantage Design for On-Policy Distillation
ArXiv CS.AI2026-06-02