Native Audio-Visual Alignment for Generation 文章

ArXiv CS.CV2026-05-29NEWSen作者: Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

查看原文 →

关系图谱

摘要

arXiv:2605.30073v1 Announce Type: new Abstract: Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process.

Native Audio-Visual Alignment for Generation 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (1)