PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding 文章

ArXiv CS.CL2026-05-26NEWSen作者: Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou

查看原文 →

关系图谱

摘要

arXiv:2602.01322v2 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2).

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (4)