Multimodal Function Vectors for Visual Relations 文章

ArXiv CS.AI2026-06-02NEWSen作者: Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

摘要

arXiv:2510.02528v2 Announce Type: replace Abstract: Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time.

Multimodal Function Vectors for Visual Relations 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (5)