A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks arXiv:2605.28556v1 Announce Type: new Abstract: As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns ag

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks · 相关技术