First, do NOHARM: towards clinically safe large language models 文章
详细信息
- 来源站点
- ArXiv CS.AI
- 作者
- David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Vartan Pahalyants, Ernest Y. Lee, Allen Shih, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, Thomas A. Buckley, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Anastasia Perez, Austin J. Schoeffler, Mahbuba Tusty, Chase M. Walton, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Arjun K. Manrai, Adam Rodman, Jonathan H. Chen, Ethan Goh
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-06-17
摘要
arXiv:2512.01241v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI.