When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents 事件

Name: When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Start: 2026-06-06

PRODUCT_LAUNCH2026-06-06影响: MEDIUM

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents arXiv:2606.05806v1 Announce Type: new Abstract: Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological

人工智能

关系图谱