Prompt Injection as Role Confusion 文章

ArXiv CS.CL2026-06-01NEWSen作者: Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

摘要

arXiv:2603.12277v5 Announce Type: replace Abstract: LLMs see the world as a single stream of text, partitioned into roles like or . We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like text, despite its label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated.

相关事件查看全部 (1)

Prompt Injection as Role Confusion
2026-06-01PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据