StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns 文章

ArXiv CS.AI2026-06-19NEWSen作者: Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

详细信息

来源站点
ArXiv CS.AI
作者
Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto
文章类型
NEWS
语言
en
发布日期
2026-06-19

摘要

arXiv:2606.19613v1 Announce Type: cross Abstract: We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据