OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning arXiv:2605.12400v2 Announce Type: replace-cross Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may in

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning · 相关技术