GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations arXiv:2605.07053v2 Announce Type: replace Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memor