Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery 文章

ArXiv CS.CL2026-05-27NEWSen作者: Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi

摘要

arXiv:2605.27315v1 Announce Type: new Abstract: Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties.

Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (4)

相关人物

相关产品查看全部 (11)

相关技术查看全部 (17)