Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units 文章

ArXiv CS.AI2026-06-09NEWSen作者: Jianhui Chen, Yuzhang Luo, Liangming Pan

详细信息

来源站点
ArXiv CS.AI
作者
Jianhui Chen, Yuzhang Luo, Liangming Pan
文章类型
NEWS
语言
en
发布日期
2026-06-09

摘要

arXiv:2601.21996v2 Announce Type: replace-cross Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据