VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding 文章

ArXiv CS.CV2026-05-26NEWSen作者: Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

摘要

arXiv:2605.25952v1 Announce Type: new Abstract: Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density.

相关公司

暂无数据

相关人物

暂无数据