SLMJury: Can Small Language Models Judge as Well as Large Ones? 文章

ArXiv CS.AI2026-06-09NEWSen作者: Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

摘要

arXiv:2606.07810v1 Announce Type: cross Abstract: Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据