arxiv:2606.18717

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Published on Jun 17

· Submitted by

Tolga Şakar on Jun 18

Lonewolf Research & Development

Upvote

Authors:

Tolga Şakar

Abstract

A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

dfavenfre-dev

Paper author Paper submitter about 3 hours ago

Turkish is agglutinative: meaning is carried
by morphemes, yet the subword tokenizers
that drive modern language models split words
by corpus statistics, fragmenting semantically
loaded suffixes and—in the case of WordPiece
and rule-based analyzers—failing to decode
their output back to the original text. This
paper presents Morpheus, a neural morpheme-
boundary model for Turkish that is at once a
lossless, morphology-aware tokenizer and a
word-embedding producer. A differentiable
Poisson–binomial dynamic program turns
per-character boundary probabilities into soft
morpheme memberships during training and
exact segments at inference, with no string
normalization, so decode(encode(w)) = w
holds by construction. Because the model is
neural, the same forward pass that tokenizes
also emits a structured word embedding.
Among reversible tokenizers—the only ones
valid for generation—Morpheus attains the
lowest bits-per-character (1.425), roughly dou-
bles the gold morphological alignment of the
subword family (MorphScore macro-F1 0.61
vs. ∼0.32), and uses ∼19% less GPU memory
than 64K-vocabulary subword tokenizers. As
an embedder, frozen Morpheus vectors lead on
lexical retrieval (root-family MAP 0.85) and
same-root verification (ROC-AUC 1.00), sur-
passing the multilingual retriever BGE-M3 and
BERTurk; on context- and inflection-dependent
tasks (NER, case/number probing) the heav-
ier contextual encoders remain ahead—a
trade-off we attribute to Morpheus’s root-
centric geometry.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18717

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18717 in a dataset README.md to link it from this page.

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1