BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base 事件
PRODUCT_LAUNCH2026-05-29影响: MEDIUM
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope
相关产品查看全部 (10)
相关报道查看全部 (1)
BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
ArXiv CS.CL2026-05-29