ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search 文章

ArXiv CS.CV2026-06-02NEWSen作者: Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin

摘要

arXiv:2606.01825v1 Announce Type: new Abstract: Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment.