ResCLIP: Residual Attention for Training-free Dense Vision-language Inference 文章

ArXiv CS.CV2026-06-03NEWSen作者: Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

摘要

arXiv:2411.15851v2 Announce Type: replace Abstract: While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据