The Attentional White Bear Effect in Transformer Language Models 文章

ArXiv CS.CL2026-05-28NEWSen作者: Rebecca Ramnauth, Brian Scassellati

摘要

arXiv:2605.28639v1 Announce Type: new Abstract: Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

The Attentional White Bear Effect in Transformer Language Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)