ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions 文章

ArXiv CS.AI2026-05-28NEWSen作者: Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

摘要

arXiv:2605.27819v1 Announce Type: cross Abstract: Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (2)