AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications 文章

ArXiv CS.AI2026-05-26NEWSen作者: Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao

摘要

arXiv:2602.22769v3 Announce Type: replace Abstract: Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between applications and evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric settings. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any Length), a benchmark designed to evaluate long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories of arbitrary horizons paired with rule-based QA.