LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis 文章

ArXiv CS.CL2026-06-01NEWSen作者: Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang

摘要

arXiv:2605.30434v1 Announce Type: cross Abstract: Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (1)

相关人物

相关产品查看全部 (1)

相关技术