Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs 文章

ArXiv CS.CV2026-06-03NEWSen作者: Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

摘要

arXiv:2606.03879v1 Announce Type: new Abstract: As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据