100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? 文章

ArXiv CS.CL2026-06-04NEWSen作者: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

详细信息

来源站点
ArXiv CS.CL
作者
Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
文章类型
NEWS
语言
en
发布日期
2026-06-04

摘要

arXiv:2505.19293v2 Announce Type: replace Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据