ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning arXiv:2512.07795v2 Announce Type: replace-cross Abstract: Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meani