DeepLatent: Think with Images via Parallel Latent Visual Reasoning 文章

ArXiv CS.CV2026-06-02NEWSen作者: Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

摘要

arXiv:2606.00562v1 Announce Type: new Abstract: The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm.