FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks 事件

Name: FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Start: 2026-05-29

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks arXiv:2605.29001v1 Announce Type: cross Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 mode

人工智能

关系图谱

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks · 相关人物

Greg

Sam