OISD: On-Policy Internal Self-Distillation of Language Models 文章

ArXiv CS.CV2026-05-29NEWSen作者: Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

摘要

arXiv:2605.29089v1 Announce Type: cross Abstract: Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both…

摘要可能不完整，可查看原文

OISD: On-Policy Internal Self-Distillation of Language Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (6)