The Shape of Wisdom: Decision Trajectories in Language Models 文章

ArXiv CS.CL2026-06-02NEWSen作者: Shailesh Rana

摘要

arXiv:2606.01202v1 Announce Type: cross Abstract: Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation.

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据