3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding 文章

ArXiv CS.CV2026-05-29NEWSen作者: Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

摘要

arXiv:2605.29416v1 Announce Type: cross Abstract: Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors.

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (8)