Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance arXiv:2606.00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that thi