Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels 文章

ArXiv CS.CL2026-05-29NEWSen作者: Guneet Kohli

摘要

arXiv:2605.29800v1 Announce Type: new Abstract: LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据