Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance 文章

ArXiv CS.CL2026-06-02NEWSen作者: Yuxuan Jiang, Francis Ferraro

摘要

arXiv:2606.00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据