SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits 文章

ArXiv CS.AI2026-05-29NEWSen作者: Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

摘要

arXiv:2604.01473v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using anchored token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal.

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (2)