Inference-Time Scaling for Joint Audio-Video Generation 事件

PRODUCT_LAUNCH2026-06-03影响: MEDIUM

Inference-Time Scaling for Joint Audio-Video Generation arXiv:2606.03183v1 Announce Type: cross Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains.