How Auxiliary Reasoning Unleashes GUI Grounding in VLMs 文章

ArXiv CS.CV2026-06-11NEWSen作者: Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Min Yu, Tongxiao Ruan, Manni Duan

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CV
作者: Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Min Yu, Tongxiao Ruan, Manni Duan
文章类型: NEWS
语言: en
发布日期: 2026-06-11

原文

摘要

arXiv:2509.11548v2 Announce Type: replace Abstract: Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to better articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs.

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (5)