Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning 文章

ArXiv CS.CL2026-06-16NEWSen作者: Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi

详细信息

来源站点
ArXiv CS.CL
作者
Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee, Adam Mahdi
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2510.17431v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find that RL-trained models inherit refusal reasoning by deflecting harmful requests into benign search queries, but this breaks down under a simple diagnostic trigger that elicits a search call before refusal can occur. Under this condition, RL models produce multi-step unsafe search actions and reasoning, reducing search query safety by up to 68.6% in Qwen and Llama models relative to their IT counterparts. The effect generalises across model families, scales, and RL algorithms. To understand why, we identify linear directions in the residual stream that control search query safety, and show that RL training progressively shifts search behaviour toward the harmful end of this direction.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据