Incremental BPE Tokenization 文章

ArXiv CS.CL2026-06-01NEWSen作者: Shenghu Jiang, Ruihao Gong

摘要

arXiv:2605.30813v1 Announce Type: new Abstract: We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs.

相关事件查看全部 (1)

Incremental BPE Tokenization
2026-06-01PRODUCT_LAUNCH影响: MEDIUM

相关人物

暂无数据