Safety Alignment of LMs via Non-cooperative Games 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
Safety Alignment of LMs via Non-cooperative Games arXiv:2512.20806v3 Announce Type: replace Abstract: Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained join
Safety Alignment of LMs via Non-cooperative Games · 相关报道
相关报道
Safety Alignment of LMs via Non-cooperative Games
ArXiv CS.AI2026-06-02