Learning GUI Grounding with Spatial Reasoning from Visual Feedback 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

Learning GUI Grounding with Spatial Reasoning from Visual Feedback arXiv:2509.21552v2 Announce Type: replace Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To add