Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover 文章

ArXiv CS.AI2026-05-29NEWSen作者: Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

摘要

arXiv:2603.11331v3 Announce Type: replace-cross Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据