Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning 文章

ArXiv CS.CL2026-06-16NEWSen作者: Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi

详细信息

来源站点: ArXiv CS.CL
作者: Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2510.17431v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find that RL-trained models inherit refusal reasoning by deflecting harmful requests into benign search queries, but this breaks down under a simple diagnostic trigger that elicits a search call before refusal can occur. Under this condition, RL models produce multi-step unsafe search actions and reasoning, reducing search query safety by up to 68.6% in Qwen and Llama models relative to their IT counterparts. The effect generalises across model families, scales, and RL algorithms. To understand why, we identify linear directions in the residual stream that control search query safety, and show that RL training progressively shifts search behaviour toward the harmful end of this direction.

Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (1)