100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? 文章

ArXiv CS.CL2026-06-04NEWSen作者: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
文章类型: NEWS
语言: en
发布日期: 2026-06-04

原文

摘要

arXiv:2505.19293v2 Announce Type: replace Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术