Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS 文章

ArXiv CS.AI2026-06-01NEWSen作者: Deokjin Seo, Gangin Park, Kihyun Nam

摘要

arXiv:2605.30748v1 Announce Type: cross Abstract: We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence.