VISUALSKILL: Multimodal Skills for Computer-Use Agents 文章

ArXiv CS.CL2026-06-18NEWSen作者: Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang
文章类型: NEWS
语言: en
发布日期: 2026-06-18

原文

摘要

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303).

VISUALSKILL: Multimodal Skills for Computer-Use Agents 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (12)

相关技术查看全部 (6)