The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic 文章

ArXiv CS.CL2026-05-29NEWSen作者: Dominika Agnieszka D{\l}ugosz, Arlindo Oliveira, Natalia D\'iaz-Rodr\'iguez

摘要

arXiv:2605.28700v2 Announce Type: replace-cross Abstract: The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases.