Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack 文章

ArXiv CS.AI2026-06-06NEWSen作者: Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang

详细信息

来源站点: ArXiv CS.AI
作者: Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang
文章类型: NEWS
语言: en
发布日期: 2026-06-06

摘要

arXiv:2606.05614v1 Announce Type: new Abstract: Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability.

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (3)