On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning 文章

ArXiv CS.CL2026-05-27NEWSen作者: Xiaotian Ye, Xiaohan Wang, Mengqi Zhang, Shu Wu

摘要

arXiv:2605.27083v1 Announce Type: new Abstract: Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools.