Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination 文章

ArXiv CS.CV2026-05-28NEWSen作者: Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CV
作者: Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma
文章类型: NEWS
语言: en
发布日期: 2026-05-28

原文

摘要

arXiv:2605.15864v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation.

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (11)

相关技术查看全部 (2)