Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models 文章

ArXiv CS.CV2026-06-02NEWSen作者: Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

摘要

arXiv:2606.00275v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information.