Action with Visual Primitives 文章

ArXiv CS.AI2026-05-26NEWSen作者: Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yuyang Pang, Wenda Xu, Gao Huang

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.AI
作者: Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yuyang Pang, Wenda Xu, Gao Huang
文章类型: NEWS
语言: en
发布日期: 2026-05-26

原文

摘要

arXiv:2605.22183v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics.

Action with Visual Primitives 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (2)