Tokenization with Split Trees 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
Tokenization with Split Trees arXiv:2605.22705v2 Announce Type: replace Abstract: We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path.