Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese 文章

ArXiv CS.AI2026-06-09NEWSen作者: Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

摘要

arXiv:2606.07853v1 Announce Type: cross Abstract: Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabi\'a-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (1)

相关人物

相关产品查看全部 (10)

相关技术