Update README.md

2064cba verified 1 day ago

6.5 kB

	---
	language:
	- tr
	tags:
	- tokenizer
	- morphology
	- turkish
	- nlp
	- transformers
	license: mit
	library_name: nedo-turkish-tokenizer
	pipeline_tag: token-classification
	---

	# NedoTurkishTokenizer

	Turkish morphological tokenizer — TR-MMLU world record 92.64%

	NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Developer \| [Ethosoft](https://huggingface.co/Ethosoft) \|
	\| Language \| Turkish (`tr`) \|
	\| License \| MIT \|
	\| Benchmark \| TR-MMLU 92.64% (world record) \|
	\| Morphological engine \| Zemberek NLP (bundled) \|

	---

	## Use This Model

	### Installation

	```bash
	pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
	```

	> Java is required for Zemberek morphological analysis.
	> If you get a Java error, install it first:
	>
	> \| OS \| Command \|
	> \|---\|---\|
	> \| Ubuntu / Debian \| `sudo apt install default-jre` \|
	> \| Fedora / RHEL \| `sudo dnf install java-latest-openjdk` \|
	> \| macOS \| `brew install openjdk` \|
	> \| Windows \| `winget install Microsoft.OpenJDK.21` \|

	---

	### With Transformers (`AutoTokenizer`)

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)

	out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
	print(out["input_ids"]) # hash-stable int IDs
	print(out["attention_mask"]) # [1, 1, 1, ...]
	print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social

	for t in out["morphological_tokens"]:
	print(t["token"], t["token_type"], t["morph_pos"])
	```

	Batch tokenization:
	```python
	out = tok(["Türkçe metin.", "Another sentence with code-switching."])
	# out["input_ids"] -> list of lists
	```

	Direct morphological tokenization:
	```python
	tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
	for t in tokens:
	print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
	if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
	if t.get("_compound"): print(f" compound={t['_parts']}", end="")
	if t.get("_expansion"): print(f" -> {t['_expansion']}", end="")
	print()
	```

	---

	### Standalone (without Transformers)

	```python
	from nedo_turkish_tokenizer import NedoTurkishTokenizer

	tok = NedoTurkishTokenizer()

	# Single text
	tokens = tok("İSTANBUL'da meeting'e katılamadım")
	for t in tokens:
	print(t["token"], t["token_type"], t["morph_pos"])

	# Batch (parallel, all CPUs)
	results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)

	# TR coverage stats
	s = tok.stats(tokens)
	print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}")
	```

	---

	### Example Output

	Input: `"İSTANBUL'da meeting'e katılamadım"`

	\| token \| token_type \| morph_pos \| notes \|
	\|---\|---\|---\|---\|
	\| `<uppercase_word>` \| ROOT \| 0 \| ALL CAPS marker \|
	\| ` istanbul` \| ROOT \| 0 \| lowercased \|
	\| `da` \| SUFFIX \| 1 \| `-LOC` \|
	\| ` meeting` \| FOREIGN \| 0 \| TDK'da yok \|
	\| `e` \| SUFFIX \| 1 \| `-DAT` \|
	\| ` katılama` \| ROOT \| 0 \| Zemberek validated \|
	\| `dı` \| SUFFIX \| 1 \| `-PST` `[PAST]` \|
	\| `m` \| SUFFIX \| 2 \| `-1SG` \|

	---

	## Output Fields

	Every token dict contains:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `token` \| `str` \| Token string — leading space means word-initial \|
	\| `token_type` \| `str` \| Morphological type (see table below) \|
	\| `morph_pos` \| `int` \| Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix... \|

	### Token Types

	\| Type \| Description \| Example \|
	\|---\|---\|---\|
	\| `ROOT` \| Turkish root word \| `kitap`, `gel` \|
	\| `SUFFIX` \| Turkish morphological suffix \| `lar`, `da`, `dı` \|
	\| `FOREIGN` \| Foreign/loanword root \| `meeting`, `zoom`, `tweet` \|
	\| `BPE` \| Unknown subword (fallback) \| rare/OOV fragments \|
	\| `PUNCT` \| Punctuation \| `.`, `,`, `?` \|
	\| `NUM` \| Number \| `3.5`, `%85` \|
	\| `DATE` \| Date \| `14.03.2026` \|
	\| `UNIT` \| Measurement unit \| `km`, `mg`, `TL` \|
	\| `URL` \| Web address \| `https://...` \|
	\| `MENTION` \| @username \| `@ethosoft` \|
	\| `HASHTAG` \| #topic \| `#NLP` \|
	\| `EMOJI` \| Emoji \| \|

	### Optional Metadata Fields

	\| Field \| Description \|
	\|---\|---\|
	\| `_canonical` \| Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` \|
	\| `_suffix_label` \| Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... \|
	\| `_foreign` \| `True` — foreign root detected by TDK lookup \|
	\| `_caps` \| `True` — originally ALL CAPS word \|
	\| `_domain` \| `True` — medical / sports / tourism domain word \|
	\| `_compound` \| `True` — compound word (e.g. `başbakan`) \|
	\| `_parts` \| Compound parts: `["baş", "bakan"]` \|
	\| `_expansion` \| Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` \|
	\| `_pos` \| POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... \|
	\| `_lemma` \| Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) \|
	\| `_disambiguated` \| `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) \|
	\| `_root_corrected` \| `True` — phonetic root correction: `"gök"` -> `"göğüs"` \|

	---

	## How It Works

	NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with 12 sequential morphological fixes:

	\| Fix \| Problem \| Solution \|
	\|---\|---\|---\|
	\| 1 \| `İSTANBUL` -> 16 BPE tokens \| Lowercase before tokenization, restore `<uppercase_word>` marker \|
	\| 2 \| `meeting'e` -> broken BPE \| Detect foreign base + Turkish suffix, split at apostrophe \|
	\| 3 \| Turkish suffixes classified as BPE \| 260+ suffix patterns reclassified -> SUFFIX \|
	\| 4 \| Wrong roots (`gök` for `göğüs`) \| Zemberek phonetic root validation & correction \|
	\| 5 \| Punctuation counted as BPE \| Classify as PUNCT \|
	\| 6 \| Medical/domain terms as BPE \| 500+ medical, sports, tourism root vocabulary \|
	\| 7 \| Foreign words as BPE \| TDK 76K+ word lookup -> FOREIGN ROOT \|
	\| 8 \| Numbers, URLs, mentions fragmented \| Pre-tokenization placeholder normalization \|
	\| 9 \| `lar`/`ler` different IDs for same morpheme \| Allomorph canonicalization (`PL`, `ACC`, `DAT`...) \|
	\| 10 \| `başbakan` as single unknown ROOT \| Compound word decomposition \|
	\| 11 \| `CMV`, `NATO` without meaning \| Acronym expansion dictionary (100+ entries) \|
	\| 12 \| `yüz` = 100 or face or swim? \| Zemberek sentence-level context disambiguation \|

	---

	## License

	MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)