MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence 文章

ArXiv CS.CV2026-06-02NEWSen作者: Hilton Raj, Vishnuram AV

摘要

arXiv:2606.02463v1 Announce Type: new Abstract: In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据