Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models 文章

ArXiv CS.CV2026-06-03NEWSen作者: S Divakar Bhat, Toshihiko Yamasaki

摘要

arXiv:2606.02742v1 Announce Type: new Abstract: Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据