nmstech's picture
Update README.md
2064cba verified
---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
- transformers
license: mit
library_name: nedo-turkish-tokenizer
pipeline_tag: token-classification
---
# NedoTurkishTokenizer
**Turkish morphological tokenizer — TR-MMLU world record 92.64%**
NedoTurkishTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
## Model Details
| | |
|---|---|
| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
| **Language** | Turkish (`tr`) |
| **License** | MIT |
| **Benchmark** | TR-MMLU **92.64%** (world record) |
| **Morphological engine** | Zemberek NLP (bundled) |
---
## Use This Model
### Installation
```bash
pip install git+https://huggingface.co/Ethosoft/NedoTurkishTokenizer
```
> **Java is required** for Zemberek morphological analysis.
> If you get a Java error, install it first:
>
> | OS | Command |
> |---|---|
> | Ubuntu / Debian | `sudo apt install default-jre` |
> | Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
> | macOS | `brew install openjdk` |
> | Windows | `winget install Microsoft.OpenJDK.21` |
---
### With Transformers (`AutoTokenizer`)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ethosoft/NedoTurkishTokenizer", trust_remote_code=True)
out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
print(out["input_ids"]) # hash-stable int IDs
print(out["attention_mask"]) # [1, 1, 1, ...]
print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
for t in out["morphological_tokens"]:
print(t["token"], t["token_type"], t["morph_pos"])
```
**Batch tokenization:**
```python
out = tok(["Türkçe metin.", "Another sentence with code-switching."])
# out["input_ids"] -> list of lists
```
**Direct morphological tokenization:**
```python
tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
for t in tokens:
print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
if t.get("_compound"): print(f" compound={t['_parts']}", end="")
if t.get("_expansion"): print(f" -> {t['_expansion']}", end="")
print()
```
---
### Standalone (without Transformers)
```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer
tok = NedoTurkishTokenizer()
# Single text
tokens = tok("İSTANBUL'da meeting'e katılamadım")
for t in tokens:
print(t["token"], t["token_type"], t["morph_pos"])
# Batch (parallel, all CPUs)
results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)
# TR coverage stats
s = tok.stats(tokens)
print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}")
```
---
### Example Output
Input: `"İSTANBUL'da meeting'e katılamadım"`
| token | token_type | morph_pos | notes |
|---|---|---|---|
| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker |
| ` istanbul` | ROOT | 0 | lowercased |
| `da` | SUFFIX | 1 | `-LOC` |
| ` meeting` | FOREIGN | 0 | TDK'da yok |
| `e` | SUFFIX | 1 | `-DAT` |
| ` katılama` | ROOT | 0 | Zemberek validated |
| `dı` | SUFFIX | 1 | `-PST` `[PAST]` |
| `m` | SUFFIX | 2 | `-1SG` |
---
## Output Fields
Every token dict contains:
| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string — leading space means word-initial |
| `token_type` | `str` | Morphological type (see table below) |
| `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix... |
### Token Types
| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish root word | `kitap`, `gel` |
| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
| `PUNCT` | Punctuation | `.`, `,`, `?` |
| `NUM` | Number | `3.5`, `%85` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
| `URL` | Web address | `https://...` |
| `MENTION` | @username | `@ethosoft` |
| `HASHTAG` | #topic | `#NLP` |
| `EMOJI` | Emoji | |
### Optional Metadata Fields
| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme: `"lar"/"ler"` -> `"PL"`, `"dan"/"den"` -> `"ABL"` |
| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, ... |
| `_foreign` | `True` — foreign root detected by TDK lookup |
| `_caps` | `True` — originally ALL CAPS word |
| `_domain` | `True` — medical / sports / tourism domain word |
| `_compound` | `True` — compound word (e.g. `başbakan`) |
| `_parts` | Compound parts: `["baş", "bakan"]` |
| `_expansion` | Acronym expansion: `"CMV"` -> `"Sitomegalovirüs"` |
| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`... |
| `_lemma` | Lemma from Zemberek: `"gelir"` -> `"gelmek"` (when verb) |
| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`...) |
| `_root_corrected` | `True` — phonetic root correction: `"gök"` -> `"göğüs"` |
---
## How It Works
NedoTurkishTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
| Fix | Problem | Solution |
|---|---|---|
| 1 | `İSTANBUL` -> 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
| 2 | `meeting'e` -> broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified -> SUFFIX |
| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
| 5 | Punctuation counted as BPE | Classify as PUNCT |
| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
| 7 | Foreign words as BPE | TDK 76K+ word lookup -> FOREIGN ROOT |
| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`...) |
| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
---
## License
MIT (c) [Ethosoft](https://huggingface.co/Ethosoft)