Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs 文章

ArXiv CS.CL2026-06-03NEWSen作者: Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

查看原文 →

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs · 相关技术

相关技术

chain-of-thought multi-hop reasoning bridge entities scalar affine adapter interpretability artifacts lightweight adapters sparse autoencoder Self-interpretation methods