Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation 文章

ArXiv CS.CL2026-06-02NEWSen作者: Joakim Edin, Casper L. Christensen, R\'obert Csord\'as, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maal{\o}e

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Joakim Edin, Casper L. Christensen, R\'obert Csord\'as, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maal{\o}e
文章类型: NEWS
语言: en
发布日期: 2026-06-02

原文

摘要

arXiv:2505.17630v4 Announce Type: replace Abstract: Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance. We find that one particularly problematic interaction is attention self-repair, in which softmax redistribution causes gradients for influential attention scores to vanish as other positions with similar values compensate. We introduce Gradient Interaction Modifications (GIM), a technique that explicitly accounts for feature interactions during backpropagation.

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (8)