OckBench: Measuring the Efficiency of LLM Reasoning 文章

ArXiv CS.CL2026-06-04NEWSen作者: Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

摘要

arXiv:2511.05722v3 Announce Type: replace Abstract: Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency.

相关事件查看全部 (1)

OckBench: Measuring the Efficiency of LLM Reasoning
2026-06-04PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据