On the Limits of Token Reduction for Efficient Unified Vision Language Training 文章

ArXiv CS.CV2026-06-02NEWSen作者: Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

摘要

arXiv:2606.01503v1 Announce Type: new Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据