DFlash: Block Diffusion for Flash Speculative Decoding 文章

ArXiv CS.CL2026-05-29NEWSen作者: Jian Chen, Yesheng Liang, Zhijian Liu

摘要

arXiv:2602.06036v2 Announce Type: replace Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting.

DFlash: Block Diffusion for Flash Speculative Decoding 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (4)