Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games 事件

ACQUISITION2026-06-02影响: HIGH

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games arXiv:2606.00103v1 Announce Type: new Abstract: We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard su

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games · 相关报道