Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish 文章

ArXiv CS.CL2026-06-18NEWSen作者: Tolga \c{S}akar

详细信息

来源站点
ArXiv CS.CL
作者
Tolga \c{S}akar
文章类型
NEWS
语言
en
发布日期
2026-06-18

摘要

arXiv:2606.18717v1 Announce Type: new Abstract: Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据