On the Limits of Token Reduction for Efficient Unified Vision Language Training 文章

ArXiv CS.CV2026-06-02NEWSen作者: Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

摘要

arXiv:2606.01503v1 Announce Type: new Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective.

On the Limits of Token Reduction for Efficient Unified Vision Language Training 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)