PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges 文章

ArXiv CS.AI2026-06-01NEWSen作者: Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.AI
作者: Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama
文章类型: NEWS
语言: en
发布日期: 2026-06-01

原文

摘要

arXiv:2605.30803v1 Announce Type: new Abstract: LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability.

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术