LaRe: Latent Refocusing for Multimodal Reasoning 文章

ArXiv CS.CV2026-05-27NEWSen作者: Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

详细信息

来源站点: ArXiv CS.CV
作者: Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan
文章类型: NEWS
语言: en
发布日期: 2026-05-27

摘要

arXiv:2511.02360v4 Announce Type: replace Abstract: Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective.

LaRe: Latent Refocusing for Multimodal Reasoning 文章

详细信息

摘要

相关事件

相关公司查看全部 (4)

相关人物

相关产品查看全部 (4)

相关技术查看全部 (22)