Training Deliberative Monitors for Black-Box Scheming Detection 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Training Deliberative Monitors for Black-Box Scheming Detection arXiv:2605.29601v1 Announce Type: new Abstract: As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action