Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions 文章

ArXiv CS.CL2026-06-03NEWSen作者: Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao

摘要

arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark.