Auditing LLM Benchmarks with Item Response Theory 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

Auditing LLM Benchmarks with Item Response Theory arXiv:2605.30504v1 Announce Type: new Abstract: LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical la

Auditing LLM Benchmarks with Item Response Theory · 相关技术