Probing for Knowledge Attribution in Large Language Models 文章

ArXiv CS.CL2026-05-28NEWSen作者: Ivo Brink, Alexander Boer, Dennis Ulmer

摘要

arXiv:2602.22787v2 Announce Type: replace Abstract: Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations, where the model misuses provided context, and factuality violations, where answers reflect errors in internal knowledge. Proper mitigation depends on knowing which source drives each answer. We study contributive attribution, i.e. the classification of the dominant knowledge source behind each output, and show that a simple linear probe trained on hidden representations can reliably identify it. We introduce AttriWiki, a self-supervised pipeline that automatically generates labelled training data by prompting models to recall withheld entities from memory or read them from context without relying on knowledge conflicts. Probes trained on AttriWiki achieve up to 0.96 Macro-$F_1$ on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transfer to SQuAD and WebQuestions with 0.94-0.

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据