Soro: A Lightweight Foundation Model and Chatbot for Tajik 文章

ArXiv CS.CL2026-05-29NEWSen作者: Stanislav Liashkov, Haitz S\'aez de Oc\'ariz Borde, Azizjon Azimi, Khushbakht Shoymardonov, Shuhratjon Khalilbekov, Bonu Boboeva

摘要

arXiv:2605.27379v2 Announce Type: replace-cross Abstract: We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets.

相关事件查看全部 (2)

相关公司查看全部 (1)

H

相关人物

暂无数据

相关技术

暂无数据