Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning 文章

ArXiv CS.CL2026-05-28NEWSen作者: Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

摘要

arXiv:2605.11458v2 Announce Type: replace-cross Abstract: On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据