MUSE: A Unified Agentic Harness for MLLMs 文章

ArXiv CS.CV2026-06-03NEWSen作者: Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

摘要

arXiv:2606.03005v1 Announce Type: new Abstract: Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs.

MUSE: A Unified Agentic Harness for MLLMs 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (1)

相关技术