REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak 事件

PRODUCT_LAUNCH2026-06-04影响: MEDIUM

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak arXiv:2605.20654v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflec