Configurable Reward Model for Balanced Safety Alignment 事件
BREAKTHROUGH2026-06-01影响: HIGH
Configurable Reward Model for Balanced Safety Alignment arXiv:2605.30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications. We introduce the Configurable Safety Reward
Configurable Reward Model for Balanced Safety Alignment · 相关报道
相关报道
Configurable Reward Model for Balanced Safety Alignment
ArXiv CS.CL2026-06-01