MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence 文章

ArXiv CS.CV2026-05-26NEWSen作者: Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

查看原文 →

关系图谱

摘要

arXiv:2505.23764v3 Announce Type: replace Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%.

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (1)

相关人物

相关产品查看全部 (4)

相关技术