Tool Calling is Linearly Readable and Steerable in Language Models 文章

ArXiv CS.CL2026-05-27NEWSen作者: Zekun Wu (University College London), Ze Wang (University College London), Seonglae Cho (Holistic AI), Yufei Yang (Imperial College London), Adriano Koshiyama (University College London), Sahan Bulathwela (University College London), Maria Perez-Ortiz (University College London)

摘要

arXiv:2605.07990v2 Announce Type: replace Abstract: When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $\tau$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough.