When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents 事件
PRODUCT_LAUNCH2026-06-06影响: MEDIUM
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents arXiv:2606.05806v1 Announce Type: new Abstract: Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological