详细信息
- 来源站点
- ArXiv CS.AI
- 作者
- Vlad Sobal, Shuo Yang, Yuting Zhang, Wei Xia, Stefano Soatto
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-06-19
摘要
arXiv:2606.19613v1 Announce Type: cross Abstract: We introduce StaminaBench, a benchmark that measures the stamina of coding agents: how many consecutive interaction turns (change requests) they can handle before failing. Unlike the prevailing fraction-of-tasks-solved metric, this matches real vibe-coding where sessions run dozens or hundreds of turns. In StaminaBench, agents implement a REST API server and modify it across a tunable number of procedurally generated follow-up change requests - 100 in our experiments, resulting in codebases of up to 6,000 lines. Tests are generated fully programmatically without LLM involvement, ensuring reproducibility and reliability; change sequences are drawn from either a hardcoded or LLM-driven sampler, both constrained to a structured action space to ensure changes are valid. The agent and the server run in an isolated environment and communicate with the benchmark through HTTP, making testing fully black-box and language-agnostic.
相关事件
暂无数据
相关公司
暂无数据
相关人物
暂无数据