HiSpec: Hierarchical Speculative Decoding for LLMs 文章

ArXiv CS.CL2026-05-27NEWSen作者: Avinash Kumar, Sujay Sanghavi, Poulami Das

摘要

arXiv:2510.01336v2 Announce Type: replace Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics.

HiSpec: Hierarchical Speculative Decoding for LLMs 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (8)

相关技术查看全部 (22)