Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels arXiv:2605.29800v1 Announce Type: new Abstract: LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natur