ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning 文章

ArXiv CS.CL2026-06-02NEWSen作者: Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

查看原文 →

关系图谱

摘要

arXiv:2512.07795v2 Announce Type: replace-cross Abstract: Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift…

摘要可能不完整，可查看原文

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (2)