Narrow Secret Loyalty Dodges Black-Box Audits 文章

ArXiv CS.AI2026-06-03NEWSen作者: Alfie Lamerton, Fabien Roger

详细信息

来源站点: ArXiv CS.AI
作者: Alfie Lamerton, Fabien Roger
文章类型: NEWS
语言: en
发布日期: 2026-06-03

摘要

arXiv:2605.06846v3 Announce Type: replace-cross Abstract: Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines.

Narrow Secret Loyalty Dodges Black-Box Audits 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (3)