Measuring the Depth of LLM Unlearning via Activation Patching 事件

REGULATION2026-05-26影响: MEDIUM

Measuring the Depth of LLM Unlearning via Activation Patching arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxi