OISD: On-Policy Internal Self-Distillation of Language Models 文章

ArXiv CS.CV2026-05-29NEWSen作者: Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

摘要

arXiv:2605.29089v1 Announce Type: cross Abstract: Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both…

摘要可能不完整,可查看原文

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据