Subliminal Learning Is Steering Vector Distillation 文章

ArXiv CS.AI2026-06-02NEWSen作者: Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy, Neel Nanda

摘要

arXiv:2606.00995v1 Announce Type: new Abstract: Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering.

相关事件查看全部 (2)

Subliminal Learning Is Steering Vector Distillation
2026-06-02OPEN_SOURCE影响: MEDIUM
Subliminal Learning Is Steering Vector Distillation
2026-06-02PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据