Grounding-Driven Attack: Improving Encoder-based Adversarial Transferability against Large Vision-Language Models 文章

ArXiv CS.CV2026-05-26NEWSen作者: Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, Haibo Hu

摘要

arXiv:2602.09431v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) have achieved impressive performance across multimodal tasks, but their reliance on visual inputs exposes them to adversarial threats. Encoder-based attacks provide an efficient alternative to end-to-end optimization by crafting perturbations through the vision encoder alone. However, existing encoder-based attacks often assume that the surrogate encoder is identical or similar to the victim LVLM's vision encoder. In this work, we present a systematic study of their transferability in more realistic black-box deployments with heterogeneous LVLM architectures. We find that model-specific visual evidence is inconsistent across models, whereas text-conditioned grounding regions are more closely tied to caption-relevant evidence and provide a more stable transfer target. However, existing attacks remain weakly aligned with and insufficiently disrupt these regions.