Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units 事件

PRODUCT_LAUNCH2026-06-09影响: MEDIUM

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units arXiv:2601.21996v2 Announce Type: replace-cross Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family,

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units · 相关人物