Generating Natural-Language Video Descriptions Using Text-Mined Knowledge 论文

2013Proceedings of the AAAI Conference on Artificial Intelligence引用 241

Multimodal Machine Learning ApplicationsHuman Pose and Action RecognitionVideo Analysis and Summarization

人工智能 Multimodal Machine Learning Applications Human Pose and Action Recognition Video Analysis and Summarization

作者

摘要

We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.

作者查看全部 (5)

Sergio Guadarrama

Kate Saenko

Raymond J. Mooney

Girish Malkarnenkar

Generating Natural-Language Video Descriptions Using Text-Mined Knowledge 论文

详细信息

摘要

作者查看全部 (5)

相关技术查看全部 (2)

相关事件

相关文章