Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback 文章

ArXiv CS.AI2026-06-09NEWSen作者: Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

摘要

arXiv:2606.09748v1 Announce Type: new Abstract: Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement;

相关公司

暂无数据

相关人物

暂无数据