A Systematic Investigation of RL-Jailbreaking in LLMs 文章

ArXiv CS.AI2026-06-04NEWSen作者: Montaser Mohammedalamen, Kevin Roice, Reginald McLean, Alyssa Lefaivre \v{S}kopac

摘要

arXiv:2605.07032v2 Announce Type: replace-cross Abstract: The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardening. Adversarial jailbreaking, the strategic manipulation of models to elicit harmful output, remains a primary threat to safe deployment. While Reinforcement Learning (RL) frames jailbreaking as a multi-step attack through sequential optimization, a mechanistic understanding of why the framework succeeds remains incomplete. To fill this gap, we present the first systematic decomposition of RL jailbreaking. We deconstruct the framework into problem formalization (reward function, action space, episode length), and algorithmic measures (RL algorithm, training data, reward-shaping) to identify the structural determinants of adversarial success. Our results reveal that the RL-jailbreaker successfully compromised all targeted models and safeguards.

A Systematic Investigation of RL-Jailbreaking in LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (5)