OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification arXiv:2606.01476v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access