IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment 文章

ArXiv CS.CV2026-06-01NEWSen作者: Simone Magistri, Dipam Goswami, Marco Mistretta, Bart{\l}omiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

查看原文 →

关系图谱

摘要

arXiv:2603.19862v2 Announce Type: replace Abstract: Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment.

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (1)