DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? 文章

ArXiv CS.CV2026-05-29NEWSen作者: Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

摘要

arXiv:2605.29615v1 Announce Type: new Abstract: Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element.

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (3)