BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base 文章

ArXiv CS.CL2026-05-29NEWSen作者: Rohan Shravan

摘要

arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (1)

相关人物

相关产品查看全部 (16)

相关技术查看全部 (4)