StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration 文章

ArXiv CS.CV2026-05-26NEWSen作者: Linrui Tian, Qi Wang, Bang Zhang

摘要

arXiv:2605.25659v1 Announce Type: new Abstract: Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning.