Constitutional Black-Box Monitoring for Scheming in LLM Agents 文章

ArXiv CS.CL2026-06-02NEWSen作者: Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

摘要

arXiv:2603.00829v2 Announce Type: replace Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each.

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据