Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation 事件
PRODUCT_LAUNCH2026-06-09影响: MEDIUM
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation arXiv:2606.08682v1 Announce Type: cross Abstract: Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter update
相关产品查看全部 (10)
相关报道查看全部 (1)
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
ArXiv CS.AI2026-06-09