Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation 文章

ArXiv CS.CV2026-06-03NEWSen作者: Dongsheng Wang, Dawei Su, Hui Huang

摘要

arXiv:2606.03100v1 Announce Type: new Abstract: Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)