AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation 文章

ArXiv CS.CV2026-05-29NEWSen作者: Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Jingjin Zhu, Manyuan Zhang, Ser-Nam Lim, Harry Yang

摘要

arXiv:2512.01334v2 Announce Type: replace Abstract: Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据