When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis 文章

ArXiv CS.AI2026-05-29NEWSen作者: Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

详细信息

来源站点: ArXiv CS.AI
作者: Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan
文章类型: NEWS
语言: en
发布日期: 2026-05-29

摘要

arXiv:2605.29025v1 Announce Type: new Abstract: Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it.

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis 文章

详细信息

摘要

相关事件

相关公司查看全部 (1)

相关人物

相关产品

相关技术