Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models 事件
PRODUCT_LAUNCH2026-05-27影响: MEDIUM
Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models arXiv:2605.27243v1 Announce Type: new Abstract: Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does n