Agent Skills Should Go Beyond Text: The Case for Visual Skills 文章

ArXiv CS.CV2026-06-02NEWSen作者: Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua

摘要

arXiv:2606.01414v1 Announce Type: new Abstract: Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them.

Agent Skills Should Go Beyond Text: The Case for Visual Skills 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品

相关技术