File size: 4,021 Bytes

---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
license: mit
library_name: turk-tokenizer
---

# TurkTokenizer

**Turkish morphological tokenizer — TR-MMLU world record 92%**

TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.

## Installation

```bash
pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
```

**Java is required** (for Zemberek morphological analysis):

| OS | Command |
|---|---|
| Ubuntu / Debian | `sudo apt install default-jre` |
| Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
| macOS | `brew install openjdk` |
| Windows | `winget install Microsoft.OpenJDK.21` |

## Quick Start

```python
from turk_tokenizer import TurkTokenizer

tok = TurkTokenizer()
tokens = tok("İstanbul'da meeting'e katılamadım")

for t in tokens:
    print(t["token"], t["token_type"], t["morph_pos"])
```

Output:
```
<uppercase_word>  ROOT    0
 istanbul         ROOT    0
da               SUFFIX  1
 meeting         FOREIGN 0
e                SUFFIX  1
 katılama        ROOT    0
dı               SUFFIX  1
m                SUFFIX  2
```

## Output Fields

Each token is a dict with the following guaranteed fields:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string (leading space = word-initial) |
| `token_type` | `str` | See types below |
| `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |

### Token Types

| Type | Description |
|---|---|
| `ROOT` | Turkish root word |
| `SUFFIX` | Turkish morphological suffix |
| `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
| `BPE` | Unknown subword (fallback) |
| `PUNCT` | Punctuation mark |
| `NUM` | Number |
| `DATE` | Date |
| `UNIT` | Measurement unit |
| `URL` | Web URL |
| `MENTION` | @username |
| `HASHTAG` | #topic |
| `EMOJI` | Emoji |

### Optional Metadata Fields

| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
| `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
| `_foreign` | `True` if foreign root |
| `_caps` | `True` if originally ALL CAPS |
| `_domain` | `True` if medical/sports/tourism domain |
| `_compound` | `True` if compound word |
| `_parts` | Compound word parts |
| `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) |
| `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
| `_lemma` | Lemma from Zemberek |
| `_disambiguated` | `True` if context disambiguation was applied |
| `_root_corrected` | `True` if root was corrected by Zemberek |

## Batch Tokenization

```python
texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
results = tok.batch_tokenize(texts, workers=4)
```

## Statistics

```python
tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
s = tok.stats(tokens)
print(f"TR coverage: {s['tr_pct']}%")
```

## Morphological Fixes Applied

1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16
2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
3. **BPE→SUFFIX** — 260+ suffix patterns reclassified
4. **Zemberek root validation** — phonetic root correction (`"gök"` → `"göğüs"`)
5. **Punctuation** — classified as PUNCT (counted in TR coverage)
6. **Domain vocabulary** — 500+ medical/sports/tourism roots
7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference
8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
9. **Allomorph canonicalization** — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL`
10. **Compound decomposition** — `"başbakan"` → `["baş", "bakan"]`
11. **Acronym expansion** — `"CMV"` → `"Sitomegalovirüs"`
12. **Context disambiguation** — Zemberek sentence-level POS selection

## Benchmark

| Benchmark | Score |
|---|---|
| TR-MMLU | **92%** (world record) |

## License

MIT