World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning 文章

ArXiv CS.CV2026-06-03NEWSen作者: Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

摘要

arXiv:2606.03603v1 Announce Type: new Abstract: World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD).