Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability 文章

ArXiv CS.AI2026-05-26NEWSen作者: David N. Olivieri, Antonio F. P\'erez Rodr\'iguez

摘要

arXiv:2605.25225v1 Announce Type: cross Abstract: Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally meaningful directions in Transformer activation space. This paper develops a field-theoretic framework for organizing and predicting such interventions. Treating the residual stream as a depth-token field, we formulate patching as localized source insertion, patch effects as sensitivity-field predictions, downstream propagation as empirical Green-function response, and patch selection as an adjoint variational problem. Empirically, we test the forward response theory in GPT-2-style autoregressive Transformers by applying localized residual-field interventions and observing the induced residual-field differences and logit-difference responses. We identify a bounded local linear regime; predict patch effects from first-order sensitivities across residual sites;