CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks 事件

OPEN_SOURCE2026-06-03影响: MEDIUM

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks arXiv:2606.03650v1 Announce Type: new Abstract: Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks · 相关人物