Intrinsically Interpretable Attention via Sparse Post-Training 文章

ArXiv CS.AI2026-05-26NEWSen作者: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf

摘要

arXiv:2512.05865v5 Announce Type: replace-cross Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them.