Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning 文章

ArXiv CS.CV2026-06-02NEWSen作者: Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

摘要

arXiv:2606.00963v1 Announce Type: new Abstract: Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory.