CLIP-like Model as a Foundational Density Ratio Estimator 文章

ArXiv CS.CV2026-06-02NEWSen作者: Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo

查看原文 →

关系图谱

摘要

arXiv:2506.22881v3 Announce Type: replace Abstract: Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities.

CLIP-like Model as a Foundational Density Ratio Estimator 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (8)

相关技术查看全部 (2)