EvA: An Evidence-First Audio Understanding Paradigm for LALMs 文章

ArXiv CS.AI2026-05-29NEWSen作者: Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.AI
作者: Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang
文章类型: NEWS
语言: en
发布日期: 2026-05-29

原文

摘要

arXiv:2603.27667v2 Announce Type: replace-cross Abstract: Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits.

EvA: An Evidence-First Audio Understanding Paradigm for LALMs 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (2)