Searching the Internet for Challenging Benchmarks at Scale 文章

ArXiv CS.CL2026-05-27NEWSen作者: Wenda Xu, Vil\'em Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

摘要

arXiv:2509.26619v3 Announce Type: replace Abstract: Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation.