Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs 文章

ArXiv CS.CL2026-06-03NEWSen作者: Lisa Bouger, Th\'eo Lasnier, Philippe Looubet Moundi, Yannick Teglia, Djam\'e Seddah

摘要

arXiv:2606.03785v1 Announce Type: new Abstract: Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings.

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (3)