Inference-Time Scaling for Joint Audio-Video Generation 文章

ArXiv CS.CV2026-06-03NEWSen作者: Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

摘要

arXiv:2606.03183v1 Announce Type: cross Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking.

Inference-Time Scaling for Joint Audio-Video Generation 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)