Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions 文章

ArXiv CS.CL2026-06-03NEWSen作者: Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao

摘要

arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark.

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (5)

相关技术查看全部 (2)