Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering 文章

ArXiv CS.CL2026-06-16NEWSen作者: Zaifu Zhan, Shuang Zhou, Rui Zhang

详细信息

来源站点
ArXiv CS.CL
作者
Zaifu Zhan, Shuang Zhou, Rui Zhang
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.15419v1 Announce Type: new Abstract: Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.