TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics 文章

ArXiv CS.AI2026-06-09NEWSen作者: QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets

详细信息

来源站点: ArXiv CS.AI
作者: QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets
文章类型: NEWS
语言: en
发布日期: 2026-06-09

摘要

arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (1)