PBT-Bench: Benchmarking AI Agents on Property-Based Testing 文章

ArXiv CS.AI2026-06-02NEWSen作者: Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du

摘要

arXiv:2605.15229v3 Announce Type: replace-cross Abstract: Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning…

摘要可能不完整,可查看原文

相关公司

暂无数据

相关人物

暂无数据