Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings 文章

ArXiv CS.CV2026-06-16NEWSen作者: Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

详细信息

来源站点: ArXiv CS.CV
作者: Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2606.15134v1 Announce Type: new Abstract: Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision.

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (9)

相关技术查看全部 (5)