The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP 文章

ArXiv CS.CL2026-06-03NEWSen作者: Henry He, Johann Frei, Raphael Schmitt

摘要

arXiv:2606.03250v1 Announce Type: new Abstract: Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks.

相关公司

暂无数据

相关人物

暂无数据