Measuring the Depth of LLM Unlearning via Activation Patching 事件
REGULATION2026-05-26影响: MEDIUM
Measuring the Depth of LLM Unlearning via Activation Patching arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxi
相关产品查看全部 (10)
相关报道查看全部 (1)
Measuring the Depth of LLM Unlearning via Activation Patching
ArXiv CS.CL2026-05-26