详细信息
- 来源站点
- ArXiv CS.AI
- 作者
- Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-06-06
摘要
arXiv:2606.05614v1 Announce Type: new Abstract: Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability.
相关事件
暂无数据
相关公司
暂无数据
相关人物
暂无数据