CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation 文章

ArXiv CS.CV2026-05-26NEWSen作者: Gyubin Lee, Junwon Lee, Juhan Nam

摘要

arXiv:2605.18916v2 Announce Type: replace-cross Abstract: We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines.

CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品

相关技术查看全部 (4)