SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment arXiv:2606.02530v1 Announce Type: cross Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution,