WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing 文章

ArXiv CS.AI2026-06-09NEWSen作者: Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris

摘要

arXiv:2606.07710v1 Announce Type: cross Abstract: The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on static drafting paradigms, utilising either autoregressive drafting models for reasoning or diffusion-based parallel drafting models for structured outputs. We empirically find that drafting accuracy fluctuates dramatically within a single sequence, leaving significant performance unrealised by static paradigms and coarse-grained routing. To address this volatility, we introduce WhiFlash, the first cross-paradigm SD method that unifies autoregressive and diffusion-based parallel drafting under a single token-level controller.

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)