ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs 文章

ArXiv CS.CV2026-06-02NEWSen作者: Yiling Gao, Hongchen Wei, Zhenzhong Chen

摘要

arXiv:2606.00543v1 Announce Type: new Abstract: In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic.

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (4)