CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation 文章

ArXiv CS.CL2026-06-02NEWSen作者: Danqing Wang, Akshay Sivaraman, Lei Li

摘要

arXiv:2606.01815v1 Announce Type: new Abstract: Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据