CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects arXiv:2510.14904v3 Announce Type: replace Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training stra