SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits 文章

ArXiv CS.AI2026-05-29NEWSen作者: Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits · 相关技术