Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation 文章

ArXiv CS.CV2026-05-26NEWSen作者: Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang

查看原文 →

关系图谱

摘要

arXiv:2605.25195v1 Announce Type: new Abstract: Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories.

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (4)

相关技术