Optimal multimodal fusion for multimedia data analysis 论文

2004引用 226

Music and Audio ProcessingVideo Analysis and SummarizationSpeech and Audio Processing

Speech and Audio Processing Music and Audio Processing Video Analysis and Summarization

作者

摘要

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modal-ities has been identified, how do we best fuse them to map to se-mantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully ana-lyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a care-ful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.

作者查看全部 (3)

John R. Smith

Kevin Chen–Chuan Chang

Edward Yi Chang

Optimal multimodal fusion for multimedia data analysis 论文

摘要

作者查看全部 (3)

相关技术查看全部 (2)

相关事件

相关文章