RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions 文章

ArXiv CS.CL2026-06-03NEWSen作者: Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

查看原文 →

关系图谱

摘要

arXiv:2606.03889v1 Announce Type: new Abstract: Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks.

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (4)

相关技术