Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs 文章

ArXiv CS.CL2026-06-03NEWSen作者: Aizierjiang Aiersilan

详细信息

来源站点: ArXiv CS.CL
作者: Aizierjiang Aiersilan
文章类型: NEWS
语言: en
发布日期: 2026-06-03

摘要

arXiv:2606.02628v1 Announce Type: cross Abstract: We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC.

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (7)