RedDebate: Safer Responses Through Multi-Agent Red Teaming Debates 文章

ArXiv CS.CL2026-06-02NEWSen作者: Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

摘要

arXiv:2506.11083v3 Announce Type: replace Abstract: We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another's reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. To support this, we propose designing distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour.

相关事件查看全部 (1)

RedDebate框架提出
BREAKTHROUGH影响: medium

相关公司

暂无数据

相关人物

暂无数据