Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? 文章

ArXiv CS.CV2026-06-01NEWSen作者: Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

摘要

arXiv:2605.30557v1 Announce Type: new Abstract: Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据