From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models 文章

ArXiv CS.AI2026-06-03NEWSen作者: Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan

摘要

arXiv:2606.03660v1 Announce Type: new Abstract: Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据