When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations 文章

ArXiv CS.CL2026-06-08NEWSen作者: Mahdi Alkaeed

摘要

arXiv:2606.07237v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs.

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (12)

相关技术