Optimal multimodal fusion for multimedia data analysis 论文

2004引用 226
Music and Audio ProcessingVideo Analysis and SummarizationSpeech and Audio Processing

摘要

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modal-ities has been identified, how do we best fuse them to map to se-mantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully ana-lyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a care-ful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.