摘要
arXiv:2605.23950v1 Announce Type: new Abstract: This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines.
相关事件查看全部 (1)
Stop Comparing LLM Agents Without Disclosing the Harness
2026-05-26PRODUCT_LAUNCH影响: MEDIUM
相关公司查看全部 (4)
相关人物
暂无数据