摘要
arXiv:2602.09431v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) have achieved impressive performance across multimodal tasks, but their reliance on visual inputs exposes them to adversarial threats. Encoder-based attacks provide an efficient alternative to end-to-end optimization by crafting perturbations through the vision encoder alone. However, existing encoder-based attacks often assume that the surrogate encoder is identical or similar to the victim LVLM's vision encoder. In this work, we present a systematic study of their transferability in more realistic black-box deployments with heterogeneous LVLM architectures. We find that model-specific visual evidence is inconsistent across models, whereas text-conditioned grounding regions are more closely tied to caption-relevant evidence and provide a more stable transfer target. However, existing attacks remain weakly aligned with and insufficiently disrupt these regions.
相关事件查看全部 (1)
相关公司查看全部 (3)
相关人物
暂无数据