It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers 文章

ArXiv CS.CL2026-05-27NEWSen作者: Yong-eun Cho

摘要

arXiv:2605.26731v1 Announce Type: cross Abstract: A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest…

摘要可能不完整，可查看原文

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (13)

相关技术查看全部 (23)