Resolution Diagnostics for Paired LLM Evaluation 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
Resolution Diagnostics for Paired LLM Evaluation arXiv:2605.30315v1 Announce Type: new Abstract: Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-
相关产品查看全部 (10)
相关报道查看全部 (1)
Resolution Diagnostics for Paired LLM Evaluation
ArXiv CS.CL2026-05-29