StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns 文章

ArXiv CS.AI2026-06-19NEWSen作者: Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto

详细信息

来源站点: ArXiv CS.AI
作者: Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto
文章类型: NEWS
语言: en
发布日期: 2026-06-19

摘要

arXiv:2606.19613v1 Announce Type: cross Abstract: We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (5)