Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models 文章

ArXiv CS.CL2026-05-28NEWSen作者: Himanshu Beniwal, Mayank Singh

摘要

arXiv:2605.27997v1 Announce Type: new Abstract: Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality.

相关公司

暂无数据

相关人物

暂无数据