Uncovering Competency Gaps in Large Language Models and Their Benchmarks 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Uncovering Competency Gaps in Large Language Models and Their Benchmarks arXiv:2512.20638v2 Announce Type: replace Abstract: The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method usi