The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes arXiv:2602.15515v2 Announce Type: replace-cross Abstract: Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environme