Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence arXiv:2605.13230v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy.