KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem 文章

ArXiv CS.AI2026-06-03NEWSen作者: Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

摘要

arXiv:2602.20217v2 Announce Type: replace-cross Abstract: Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.

相关公司

暂无数据

相关人物

暂无数据