Safety Alignment of LMs via Non-cooperative Games 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
Safety Alignment of LMs via Non-cooperative Games arXiv:2512.20806v3 Announce Type: replace Abstract: Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained join
相关产品查看全部 (10)
相关报道查看全部 (1)
Safety Alignment of LMs via Non-cooperative Games
ArXiv CS.AI2026-06-02