DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? 文章

ArXiv CS.CV2026-05-29NEWSen作者: Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

摘要

arXiv:2605.29615v1 Announce Type: new Abstract: Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element.

相关公司

暂无数据

相关人物

暂无数据