Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units 文章

ArXiv CS.AI2026-06-09NEWSen作者: Jianhui Chen, Yuzhang Luo, Liangming Pan

详细信息

来源站点: ArXiv CS.AI
作者: Jianhui Chen, Yuzhang Luo, Liangming Pan
文章类型: NEWS
语言: en
发布日期: 2026-06-09

摘要

arXiv:2601.21996v2 Announce Type: replace-cross Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability.

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (6)