Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs 事件

PRODUCT_LAUNCH2026-06-09影响: MEDIUM

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs arXiv:2606.07963v1 Announce Type: new Abstract: Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) o

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs · 相关产品