FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks arXiv:2605.29001v1 Announce Type: cross Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 mode