SindBERT, the Sailor: Charting the Seas of Turkish NLP 文章

ArXiv CS.CL2026-06-02NEWSen作者: Raphael Schmitt, Stefan Schweter

摘要

arXiv:2510.21364v2 Announce Type: replace Abstract: Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312~GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall.

SindBERT, the Sailor: Charting the Seas of Turkish NLP 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (11)

相关技术查看全部 (4)