LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries 文章

ArXiv CS.CL2026-05-26NEWSen作者: Ming Yin, Dinghan Shen, Silei Xu, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Jianbing Han, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

查看原文 →

关系图谱

摘要

arXiv:2508.15760v2 Announce Type: replace Abstract: Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 real-world queries that require coordinated use of multiple MCP tools. To address temporal variability in real-world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real-time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting challenges in multi-step tool use.

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (2)

相关人物

相关产品查看全部 (7)

相关技术查看全部 (30)