Query Circuits: Explaining How Language Models Answer User Prompts 文章

ArXiv CS.AI2026-06-02NEWSen作者: Tung-Yu Wu, Fazl Barez

摘要

arXiv:2509.24808v2 Announce Type: replace Abstract: Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据