摘要
arXiv:2506.00250v4 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 41 state-of-the-art models, including general-purpose, Persian, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-weight general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.
相关事件查看全部 (2)
相关人物
暂无数据