Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery 文章

ArXiv CS.AI2026-06-01NEWSen作者: Tingting Chen, Beibei Lin, Srinivas Anumasa, Vedant Shah, Zifeng Yuan, Qiran Zou, Anirudh Goyal, Dianbo Liu

摘要

arXiv:2502.15224v2 Announce Type: replace-cross Abstract: Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in noisy, open-ended scientific environments, it is useful to isolate this prerequisite capability under controlled conditions. We introduce Auto-Discovery-Bench, a deterministic oracle-guided diagnostic benchmark in which agents recover hidden structures through repeated hypothesis--intervention--feedback cycles. The benchmark instantiates three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery. Across models, performance degrades as the number of variables, trajectory length, and distractors increase.