QuoVLA: Quotient Space for Vision-Language-Action Models 文章

ArXiv CS.CV2026-05-26NEWSen作者: Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

摘要

arXiv:2605.24890v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations.

QuoVLA: Quotient Space for Vision-Language-Action Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (3)