Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. Fo
相关报道查看全部 (1)
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
ArXiv CS.CL2026-06-02