Update README.md

2064cba verified 1 day ago

6.5 kB

language:
  - tr
tags:
  - tokenizer
  - morphology
  - turkish
  - nlp
  - transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification

NedoTurkishTokenizer

Turkish morphological tokenizer — TR-MMLU world record 92.64%

NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by Zemberek NLP.

Model Details


Developer	Ethosoft
Language	Turkish (`tr`)
License	MIT
Benchmark	TR-MMLU 92.64% (world record)
Morphological engine	Zemberek NLP (bundled)

Use This Model

Installation

pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer

Java is required for Zemberek morphological analysis. If you get a Java error, install it first:

OS Command

Ubuntu / Debian sudo apt install default-jre

Fedora / RHEL sudo dnf install java-latest-openjdk

macOS brew install openjdk

Windows winget install Microsoft.OpenJDK.21

OS	Command
Ubuntu / Debian	`sudo apt install default-jre`
Fedora / RHEL	`sudo dnf install java-latest-openjdk`
macOS	`brew install openjdk`
Windows	`winget install Microsoft.OpenJDK.21`

With Transformers (`AutoTokenizer`)

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"])            # hash-stable int IDs
print(out["attention_mask"])       # [1, 1, 1, ...]
print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

for t in out["morphological_tokens"]:
    print(t["token"], t["token_type"], t["morph_pos"])

Batch tokenization:

out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"]  -> list of lists

Direct morphological tokenization:

tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
    if t.get("_expansion"):   print(f"  -> {t['_expansion']}", end="")
    print()

Standalone (without Transformers)

from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])

# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")

Example Output

Input: "İSTANBUL'da meeting'e katılamadım"

token	token_type	morph_pos	notes
`<uppercase_word>`	ROOT	0	ALL CAPS marker
`istanbul`	ROOT	0	lowercased
`da`	SUFFIX	1	`-LOC`
`meeting`	FOREIGN	0	TDK'da yok
`e`	SUFFIX	1	`-DAT`
`katılama`	ROOT	0	Zemberek validated
`dı`	SUFFIX	1	`-PST` `[PAST]`
`m`	SUFFIX	2	`-1SG`

Output Fields

Every token dict contains:

Field	Type	Description
`token`	`str`	Token string — leading space means word-initial
`token_type`	`str`	Morphological type (see table below)
`morph_pos`	`int`	Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix...

Token Types

Type	Description	Example
`ROOT`	Turkish root word	`kitap`, `gel`
`SUFFIX`	Turkish morphological suffix	`lar`, `da`, `dı`
`FOREIGN`	Foreign/loanword root	`meeting`, `zoom`, `tweet`
`BPE`	Unknown subword (fallback)	rare/OOV fragments
`PUNCT`	Punctuation	`.`, `,`, `?`
`NUM`	Number	`3.5`, `%85`
`DATE`	Date	`14.03.2026`
`UNIT`	Measurement unit	`km`, `mg`, `TL`
`URL`	Web address	`https://...`
`MENTION`	@username	`@ethosoft`
`HASHTAG`	#topic	`#NLP`
`EMOJI`	Emoji

Optional Metadata Fields

Field	Description
`_canonical`	Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"`
`_suffix_label`	Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ...
`_foreign`	`True` — foreign root detected by TDK lookup
`_caps`	`True` — originally ALL CAPS word
`_domain`	`True` — medical / sports / tourism domain word
`_compound`	`True` — compound word (e.g. `başbakan`)
`_parts`	Compound parts: `["baş", "bakan"]`
`_expansion`	Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"`
`_pos`	POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`...
`_lemma`	Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb)
`_disambiguated`	`True` — context disambiguation applied (`"yüz"`, `"gelir"`...)
`_root_corrected`	`True` — phonetic root correction: `"gök"` -> `"göğüs"`

How It Works

NedoTurkishTokenizer wraps the base turkish-tokenizer BPE model with 12 sequential morphological fixes:

Fix	Problem	Solution
1	`İSTANBUL` -> 16 BPE tokens	Lowercase before tokenization, restore `<uppercase_word>` marker
2	`meeting'e` -> broken BPE	Detect foreign base + Turkish suffix, split at apostrophe
3	Turkish suffixes classified as BPE	260+ suffix patterns reclassified -> SUFFIX
4	Wrong roots (`gök` for `göğüs`)	Zemberek phonetic root validation & correction
5	Punctuation counted as BPE	Classify as PUNCT
6	Medical/domain terms as BPE	500+ medical, sports, tourism root vocabulary
7	Foreign words as BPE	TDK 76K+ word lookup -> FOREIGN ROOT
8	Numbers, URLs, mentions fragmented	Pre-tokenization placeholder normalization
9	`lar`/`ler` different IDs for same morpheme	Allomorph canonicalization (`PL`, `ACC`, `DAT`...)
10	`başbakan` as single unknown ROOT	Compound word decomposition
11	`CMV`, `NATO` without meaning	Acronym expansion dictionary (100+ entries)
12	`yüz` = 100 or face or swim?	Zemberek sentence-level context disambiguation

License

MIT (c) Ethosoft