Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data 文章

ArXiv CS.CV2026-05-27NEWSen作者: Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

详细信息

来源站点: ArXiv CS.CV
作者: Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa
文章类型: NEWS
语言: en
发布日期: 2026-05-27

摘要

arXiv:2601.12809v2 Announce Type: replace Abstract: Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting.

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data 文章

详细信息

摘要

相关事件

相关公司查看全部 (3)

相关人物

相关产品查看全部 (9)

相关技术查看全部 (20)