Training Deliberative Monitors for Black-Box Scheming Detection 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Training Deliberative Monitors for Black-Box Scheming Detection arXiv:2605.29601v1 Announce Type: new Abstract: As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action

Training Deliberative Monitors for Black-Box Scheming Detection · 相关技术