When Behavioral Safety Evaluation Fails: A Representation-Level Perspective 事件

PRODUCT_LAUNCH2026-06-09影响: MEDIUM

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: cross Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study