RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning 事件
BREAKTHROUGH2026-06-10影响: HIGH
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning arXiv:2606.10254v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial
相关产品查看全部 (10)
相关报道查看全部 (1)
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
ArXiv CS.CL2026-06-10