Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models arXiv:2605.27997v1 Announce Type: new Abstract: Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and ne