Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Abstract
A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.
Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.
Community
Turkish is agglutinative: meaning is carried
by morphemes, yet the subword tokenizers
that drive modern language models split words
by corpus statistics, fragmenting semantically
loaded suffixes and—in the case of WordPiece
and rule-based analyzers—failing to decode
their output back to the original text. This
paper presents Morpheus, a neural morpheme-
boundary model for Turkish that is at once a
lossless, morphology-aware tokenizer and a
word-embedding producer. A differentiable
Poisson–binomial dynamic program turns
per-character boundary probabilities into soft
morpheme memberships during training and
exact segments at inference, with no string
normalization, so decode(encode(w)) = w
holds by construction. Because the model is
neural, the same forward pass that tokenizes
also emits a structured word embedding.
Among reversible tokenizers—the only ones
valid for generation—Morpheus attains the
lowest bits-per-character (1.425), roughly dou-
bles the gold morphological alignment of the
subword family (MorphScore macro-F1 0.61
vs. ∼0.32), and uses ∼19% less GPU memory
than 64K-vocabulary subword tokenizers. As
an embedder, frozen Morpheus vectors lead on
lexical retrieval (root-family MAP 0.85) and
same-root verification (ROC-AUC 1.00), sur-
passing the multilingual retriever BGE-M3 and
BERTurk; on context- and inflection-dependent
tasks (NER, case/number probing) the heav-
ier contextual encoders remain ahead—a
trade-off we attribute to Morpheus’s root-
centric geometry.
Get this paper in your agent:
hf papers read 2606.18717 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper