LaRe: Latent Refocusing for Multimodal Reasoning 文章

ArXiv CS.CV2026-05-27NEWSen作者: Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

摘要

arXiv:2511.02360v4 Announce Type: replace Abstract: Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective.