VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes 文章

ArXiv CS.CV2026-05-26NEWSen作者: Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

查看原文 →

关系图谱

摘要

arXiv:2509.25339v3 Announce Type: replace Abstract: Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene.

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (5)

相关人物

相关产品查看全部 (11)

相关技术查看全部 (17)