Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning 文章

ArXiv CS.CV2026-05-29NEWSen作者: Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

摘要

arXiv:2605.29577v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据