SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment arXiv:2606.02530v1 Announce Type: cross Abstract: Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution,
相关产品查看全部 (10)
相关报道查看全部 (1)
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
ArXiv CS.CL2026-06-02