EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent 文章

ArXiv CS.CL2026-06-24PAPERen作者: Zeyao Du, Tong Li, Yanci Zhang, Haibo Zhang

详细信息

来源站点
ArXiv CS.CL
作者
Zeyao Du, Tong Li, Yanci Zhang, Haibo Zhang
文章类型
PAPER
语言
en
发布日期
2026-06-24

摘要

arXiv:2606.17698v2 Announce Type: replace-cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source.

相关事件

暂无数据

相关人物

暂无数据