Uncovering Competency Gaps in Large Language Models and Their Benchmarks 文章

ArXiv CS.CL2026-06-02NEWSen作者: Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

查看原文 →

关系图谱

摘要

arXiv:2512.20638v2 Announce Type: replace Abstract: The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g.

Uncovering Competency Gaps in Large Language Models and Their Benchmarks 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品

相关技术查看全部 (1)