Interpretability Transfer from Language to Vision via Sparse Autoencoders 文章

ArXiv CS.CV2026-05-26NEWSen作者: Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri

摘要

arXiv:2605.24946v1 Announce Type: new Abstract: Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image.

相关公司

暂无数据

相关人物

暂无数据