Retrying vs Resampling in AI Control 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

Retrying vs Resampling in AI Control arXiv:2605.26047v1 Announce Type: new Abstract: AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study

Retrying vs Resampling in AI Control · 相关人物