摘要
arXiv:2502.15224v2 Announce Type: replace-cross Abstract: Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in noisy, open-ended scientific environments, it is useful to isolate this prerequisite capability under controlled conditions. We introduce Auto-Discovery-Bench, a deterministic oracle-guided diagnostic benchmark in which agents recover hidden structures through repeated hypothesis--intervention--feedback cycles. The benchmark instantiates three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery. Across models, performance degrades as the number of variables, trajectory length, and distractors increase.
相关事件查看全部 (1)
相关公司
暂无数据
相关人物
暂无数据