Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges 文章

ArXiv CS.CL2026-06-05NEWSen作者: Srimonti Dutta, Akshata Kishore Moharir

摘要

arXiv:2606.05384v1 Announce Type: cross Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据