摘要
arXiv:2510.10185v3 Announce Type: replace Abstract: Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making.
相关事件查看全部 (1)
相关人物
暂无数据