The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes arXiv:2602.15515v2 Announce Type: replace-cross Abstract: Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environme
相关产品查看全部 (10)
相关报道查看全部 (1)
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
ArXiv CS.AI2026-05-28