When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models arXiv:2605.27851v1 Announce Type: new Abstract: Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models · 相关技术