Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning 文章

ArXiv CS.CV2026-05-29NEWSen作者: Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

摘要

arXiv:2605.29577v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space.

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)