Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability 文章

ArXiv CS.AI2026-06-06NEWSen作者: Seyed Arshan Dalili, Mehrdad Mahdavi

摘要

arXiv:2606.06333v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)