Locality Matters for Training-Free Audio Token Compression in Audio-Language Models 文章

ArXiv CS.CL2026-05-26NEWSen作者: Jiale Luo, Xiaoyu Liang, Haoji Hu

摘要

arXiv:2605.25179v1 Announce Type: new Abstract: Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据