Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking 文章

ArXiv CS.CV2026-05-27NEWSen作者: Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

摘要

arXiv:2605.26933v1 Announce Type: new Abstract: Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models.