SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? arXiv:2605.30104v1 Announce Type: new Abstract: Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-