Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation 文章

ArXiv CS.CV2026-05-26NEWSen作者: Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

摘要

arXiv:2605.01284v2 Announce Type: replace Abstract: Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates.