AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens 文章

ArXiv CS.CL2026-05-28NEWSen作者: Tung-Ling Li, Yuhao Wu, Hongliang Liu

摘要

arXiv:2512.17375v2 Announce Type: replace-cross Abstract: LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from "No" to "Yes". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $>$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells across six Qwen, Llama, and Gemma judges, versus 54-72% for the prior curated 10-token benchmark, and the discovered surface transfers cross-format to a 70B scalar reward model.

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (5)

相关技术查看全部 (2)