Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory 文章

ArXiv CS.AI2026-05-26NEWSen作者: Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

摘要

arXiv:2601.22984v2 Announce Type: replace Abstract: Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench.

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (15)

相关技术查看全部 (22)