nmstech's picture
Update README.md
2064cba verified
metadata
language:
  - tr
tags:
  - tokenizer
  - morphology
  - turkish
  - nlp
  - transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification

NedoTurkishTokenizer

Turkish morphological tokenizer — TR-MMLU world record 92.64%

NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by Zemberek NLP.

Model Details

Developer Ethosoft
Language Turkish (tr)
License MIT
Benchmark TR-MMLU 92.64% (world record)
Morphological engine Zemberek NLP (bundled)

Use This Model

Installation

pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer

Java is required for Zemberek morphological analysis. If you get a Java error, install it first:

OS Command
Ubuntu / Debian sudo apt install default-jre
Fedora / RHEL sudo dnf install java-latest-openjdk
macOS brew install openjdk
Windows winget install Microsoft.OpenJDK.21

With Transformers (AutoTokenizer)

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"])            # hash-stable int IDs
print(out["attention_mask"])       # [1, 1, 1, ...]
print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

for t in out["morphological_tokens"]:
    print(t["token"], t["token_type"], t["morph_pos"])

Batch tokenization:

out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"]  -> list of lists

Direct morphological tokenization:

tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
    if t.get("_expansion"):   print(f"  -> {t['_expansion']}", end="")
    print()

Standalone (without Transformers)

from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])

# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")

Example Output

Input: "İSTANBUL'da meeting'e katılamadım"

token token_type morph_pos notes
<uppercase_word> ROOT 0 ALL CAPS marker
istanbul ROOT 0 lowercased
da SUFFIX 1 -LOC
meeting FOREIGN 0 TDK'da yok
e SUFFIX 1 -DAT
katılama ROOT 0 Zemberek validated
SUFFIX 1 -PST [PAST]
m SUFFIX 2 -1SG

Output Fields

Every token dict contains:

Field Type Description
token str Token string — leading space means word-initial
token_type str Morphological type (see table below)
morph_pos int Position within word: 0=root, 1=1st suffix, 2=2nd suffix...

Token Types

Type Description Example
ROOT Turkish root word kitap, gel
SUFFIX Turkish morphological suffix lar, da,
FOREIGN Foreign/loanword root meeting, zoom, tweet
BPE Unknown subword (fallback) rare/OOV fragments
PUNCT Punctuation ., ,, ?
NUM Number 3.5, %85
DATE Date 14.03.2026
UNIT Measurement unit km, mg, TL
URL Web address https://...
MENTION @username @ethosoft
HASHTAG #topic #NLP
EMOJI Emoji

Optional Metadata Fields

Field Description
_canonical Canonical morpheme: "lar"/"ler" -> "PL", "dan"/"den" -> "ABL"
_suffix_label Detailed morphological label: -PL+ACC, -P3+LOC, ...
_foreign True — foreign root detected by TDK lookup
_caps True — originally ALL CAPS word
_domain True — medical / sports / tourism domain word
_compound True — compound word (e.g. başbakan)
_parts Compound parts: ["baş", "bakan"]
_expansion Acronym expansion: "CMV" -> "Sitomegalovirüs"
_pos POS tag from Zemberek: Noun, Verb, Adj, Num...
_lemma Lemma from Zemberek: "gelir" -> "gelmek" (when verb)
_disambiguated True — context disambiguation applied ("yüz", "gelir"...)
_root_corrected True — phonetic root correction: "gök" -> "göğüs"

How It Works

NedoTurkishTokenizer wraps the base turkish-tokenizer BPE model with 12 sequential morphological fixes:

Fix Problem Solution
1 İSTANBUL -> 16 BPE tokens Lowercase before tokenization, restore <uppercase_word> marker
2 meeting'e -> broken BPE Detect foreign base + Turkish suffix, split at apostrophe
3 Turkish suffixes classified as BPE 260+ suffix patterns reclassified -> SUFFIX
4 Wrong roots (gök for göğüs) Zemberek phonetic root validation & correction
5 Punctuation counted as BPE Classify as PUNCT
6 Medical/domain terms as BPE 500+ medical, sports, tourism root vocabulary
7 Foreign words as BPE TDK 76K+ word lookup -> FOREIGN ROOT
8 Numbers, URLs, mentions fragmented Pre-tokenization placeholder normalization
9 lar/ler different IDs for same morpheme Allomorph canonicalization (PL, ACC, DAT...)
10 başbakan as single unknown ROOT Compound word decomposition
11 CMV, NATO without meaning Acronym expansion dictionary (100+ entries)
12 yüz = 100 or face or swim? Zemberek sentence-level context disambiguation

License

MIT (c) Ethosoft