OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning 文章

ArXiv CS.AI2026-06-02NEWSen作者: Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

摘要

arXiv:2605.12400v2 Announce Type: replace-cross Abstract: We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据