摘要
arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve.
相关事件查看全部 (1)
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
2026-06-05PRODUCT_LAUNCH影响: MEDIUM
相关公司
暂无数据
相关人物
暂无数据
相关产品
暂无数据