EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions 文章

ArXiv CS.AI2026-05-26NEWSen作者: Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li

详细信息

来源站点: ArXiv CS.AI
作者: Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li
文章类型: NEWS
语言: en
发布日期: 2026-05-26

摘要

arXiv:2605.24110v1 Announce Type: new Abstract: Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4).

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions 文章

详细信息

摘要

相关事件

相关公司查看全部 (3)

相关人物

相关产品查看全部 (8)

相关技术查看全部 (22)