Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence 文章

ArXiv CS.CV2026-06-01NEWSen作者: Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

查看原文 →

关系图谱

摘要

arXiv:2605.30698v1 Announce Type: new Abstract: Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus.

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (4)