MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies 文章

ArXiv CS.CL2026-05-27NEWSen作者: Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma

摘要

arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters.

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (7)

相关技术查看全部 (19)