AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models arXiv:2605.25901v1 Announce Type: new Abstract: 3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided b