Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval 文章

ArXiv CS.CV2026-06-02NEWSen作者: Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua

详细信息

来源站点: ArXiv CS.CV
作者: Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua
文章类型: NEWS
语言: en
发布日期: 2026-06-02

摘要

arXiv:2606.01615v1 Announce Type: new Abstract: Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems.

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval 文章

详细信息

摘要

相关事件

相关公司

相关人物查看全部 (1)

相关产品

相关技术查看全部 (2)