CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks 文章

ArXiv CS.CL2026-06-03NEWSen作者: Alexander Apartsin, Yehudit Aperstein

摘要

arXiv:2606.03650v1 Announce Type: new Abstract: Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each run, and a cross-family judge ensemble ranks candidate models with no human raters. Validated where ground truth exists, CoEval recovers the true model ranking and tracks ground-truth correctness at ho=0.86.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据