Constitutional On-Policy Safe Distillation 文章

ArXiv CS.AI2026-06-03NEWSen作者: Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng, Zhuoer Xu, Yuhao Sun, Shiwen Cui, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang

摘要

arXiv:2606.03089v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension.

相关事件查看全部 (1)

Constitutional On-Policy Safe Distillation
2026-06-03PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据