Attention Alignment Between Humans and Vision-Language Models 文章

ArXiv CS.CV2026-06-17NEWSen作者: Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano

详细信息

来源站点
ArXiv CS.CV
作者
Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano
文章类型
NEWS
语言
en
发布日期
2026-06-17

摘要

arXiv:2606.17410v1 Announce Type: new Abstract: Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%).

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据