File size: 4,021 Bytes
74f2384 ca41c16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | ---
language:
- tr
tags:
- tokenizer
- morphology
- turkish
- nlp
license: mit
library_name: turk-tokenizer
---
# TurkTokenizer
**Turkish morphological tokenizer — TR-MMLU world record 92%**
TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.
## Installation
```bash
pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
```
**Java is required** (for Zemberek morphological analysis):
| OS | Command |
|---|---|
| Ubuntu / Debian | `sudo apt install default-jre` |
| Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
| macOS | `brew install openjdk` |
| Windows | `winget install Microsoft.OpenJDK.21` |
## Quick Start
```python
from turk_tokenizer import TurkTokenizer
tok = TurkTokenizer()
tokens = tok("İstanbul'da meeting'e katılamadım")
for t in tokens:
print(t["token"], t["token_type"], t["morph_pos"])
```
Output:
```
<uppercase_word> ROOT 0
istanbul ROOT 0
da SUFFIX 1
meeting FOREIGN 0
e SUFFIX 1
katılama ROOT 0
dı SUFFIX 1
m SUFFIX 2
```
## Output Fields
Each token is a dict with the following guaranteed fields:
| Field | Type | Description |
|---|---|---|
| `token` | `str` | Token string (leading space = word-initial) |
| `token_type` | `str` | See types below |
| `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |
### Token Types
| Type | Description |
|---|---|
| `ROOT` | Turkish root word |
| `SUFFIX` | Turkish morphological suffix |
| `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
| `BPE` | Unknown subword (fallback) |
| `PUNCT` | Punctuation mark |
| `NUM` | Number |
| `DATE` | Date |
| `UNIT` | Measurement unit |
| `URL` | Web URL |
| `MENTION` | @username |
| `HASHTAG` | #topic |
| `EMOJI` | Emoji |
### Optional Metadata Fields
| Field | Description |
|---|---|
| `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
| `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
| `_foreign` | `True` if foreign root |
| `_caps` | `True` if originally ALL CAPS |
| `_domain` | `True` if medical/sports/tourism domain |
| `_compound` | `True` if compound word |
| `_parts` | Compound word parts |
| `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) |
| `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
| `_lemma` | Lemma from Zemberek |
| `_disambiguated` | `True` if context disambiguation was applied |
| `_root_corrected` | `True` if root was corrected by Zemberek |
## Batch Tokenization
```python
texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
results = tok.batch_tokenize(texts, workers=4)
```
## Statistics
```python
tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
s = tok.stats(tokens)
print(f"TR coverage: {s['tr_pct']}%")
```
## Morphological Fixes Applied
1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16
2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
3. **BPE→SUFFIX** — 260+ suffix patterns reclassified
4. **Zemberek root validation** — phonetic root correction (`"gök"` → `"göğüs"`)
5. **Punctuation** — classified as PUNCT (counted in TR coverage)
6. **Domain vocabulary** — 500+ medical/sports/tourism roots
7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference
8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
9. **Allomorph canonicalization** — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL`
10. **Compound decomposition** — `"başbakan"` → `["baş", "bakan"]`
11. **Acronym expansion** — `"CMV"` → `"Sitomegalovirüs"`
12. **Context disambiguation** — Zemberek sentence-level POS selection
## Benchmark
| Benchmark | Score |
|---|---|
| TR-MMLU | **92%** (world record) |
## License
MIT
|