HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering 文章

ArXiv CS.CL2026-05-29NEWSen作者: Chaodong Tong, Qi Zhang, Zhuojun Jiang, Lei Jiang, Yanbing Liu

摘要

arXiv:2512.24562v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong question answering (QA) performance but can produce fluent answers unsupported by available evidence. Existing hallucination detectors often rely on external verification, repeated sampling, or test-time judge calls, which can be costly for real-time QA. We propose \textbf{HaluNet}, a lightweight hallucination risk estimator that uses internal signals from one model generation. HaluNet jointly models token likelihood, predictive entropy, and hidden-state information, allowing probabilistic, distributional, and semantic evidence to inform an answer-level risk score. It is trained with LLM-as-a-Judge labels as scalable weak supervision and evaluated with independent human and multi-judge assessments. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet improves answer-level risk ranking across in-domain and out-of-domain settings.