Retrying vs Resampling in AI Control 文章

ArXiv CS.AI2026-05-26NEWSen作者: James Lucassen, Adam Kaufman

摘要

arXiv:2605.26047v1 Announce Type: new Abstract: AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness.

Retrying vs Resampling in AI Control 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (10)

相关技术