Pitfalls of Evaluating Language Models with Open Benchmarks 文章

ArXiv CS.CL2026-06-05NEWSen作者: Md. Najib Hasan (Wichita State University), Md Mahadi Hassan Sibat (University of Central Florida), Mohammad Fakhruddin Babar (Washington State University), Souvika Sarkar (Wichita State University), Monowar Hasan (Washington State University), Santu Karmaker (University of Central Florida)

查看原文 →

关系图谱

摘要

arXiv:2507.00460v3 Announce Type: replace Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets.

Pitfalls of Evaluating Language Models with Open Benchmarks 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (10)

相关技术