Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence 事件

Name: Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence
Start: 2026-05-29

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence arXiv:2605.13230v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy.

人工智能

关系图谱