When Behavioral Safety Evaluation Fails: A Representation-Level Perspective 事件
PRODUCT_LAUNCH2026-06-09影响: MEDIUM
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: cross Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study
相关 产品查看全部 (10)
相关报道查看全部 (1)
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
ArXiv CS.AI2026-06-09