PortBERT: Navigating the Depths of Portuguese Language Models 文章

ArXiv CS.CL2026-06-02NEWSen作者: Raphael Scheible-Schmitt, Henry He, Armando B. Mendes

摘要

arXiv:2606.02100v1 Announce Type: new Abstract: Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models.

PortBERT: Navigating the Depths of Portuguese Language Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (24)

相关技术查看全部 (8)