CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning 文章

ArXiv CS.CV2026-05-27NEWSen作者: Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

查看原文 →

关系图谱

摘要

arXiv:2605.26967v1 Announce Type: new Abstract: Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions.

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (5)

相关人物

相关产品查看全部 (8)

相关技术查看全部 (11)