Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning arXiv:2605.28365v1 Announce Type: cross Abstract: Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved c