The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic arXiv:2605.28700v1 Announce Type: cross Abstract: The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using