Ethosoft
/

NedoTurkishTokenizer

@@ -6,152 +6,216 @@ tags:
 - morphology
 - turkish
 - nlp
 license: mit
 library_name: turk-tokenizer
 ---
 # TurkTokenizer
 **Turkish morphological tokenizer — TR-MMLU world record 92%**
-TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.
-## Installation
 ```bash
 pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
 ```
-**Java is required** (for Zemberek morphological analysis):
-| OS | Command |
-|---|---|
-| Ubuntu / Debian | `sudo apt install default-jre` |
-| Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
-| macOS | `brew install openjdk` |
-| Windows | `winget install Microsoft.OpenJDK.21` |
-## Quick Start
-**Direct usage:**
 ```python
 from turk_tokenizer import TurkTokenizer
 tok = TurkTokenizer()
-tokens = tok("İstanbul'da meeting'e katılamadım")
 for t in tokens:
     print(t["token"], t["token_type"], t["morph_pos"])
 ```
-**HuggingFace AutoTokenizer:**
-```python
-from transformers import AutoTokenizer
-tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
-out = tok("İstanbul'da meeting'e katılamadım")
-out["input_ids"]             # hash-stable int IDs
-out["attention_mask"]        # [1, 1, 1, ...]
-out["token_type_ids"]        # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
-out["morphological_tokens"]  # full morphological dicts
-# Batch:
-out = tok(["Türkçe metin.", "Another sentence."])
-```
-Output:
-```
-<uppercase_word>  ROOT    0
- istanbul         ROOT    0
-da               SUFFIX  1
- meeting         FOREIGN 0
-e                SUFFIX  1
- katılama        ROOT    0
-dı               SUFFIX  1
-m                SUFFIX  2
-```
 ## Output Fields
-Each token is a dict with the following guaranteed fields:
 | Field | Type | Description |
 |---|---|---|
-| `token` | `str` | Token string (leading space = word-initial) |
-| `token_type` | `str` | See types below |
-| `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |
 ### Token Types
-| Type | Description |
-|---|---|
-| `ROOT` | Turkish root word |
-| `SUFFIX` | Turkish morphological suffix |
-| `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
-| `BPE` | Unknown subword (fallback) |
-| `PUNCT` | Punctuation mark |
-| `NUM` | Number |
-| `DATE` | Date |
-| `UNIT` | Measurement unit |
-| `URL` | Web URL |
-| `MENTION` | @username |
-| `HASHTAG` | #topic |
-| `EMOJI` | Emoji |
 ### Optional Metadata Fields
 | Field | Description |
 |---|---|
-| `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
-| `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
-| `_foreign` | `True` if foreign root |
-| `_caps` | `True` if originally ALL CAPS |
-| `_domain` | `True` if medical/sports/tourism domain |
-| `_compound` | `True` if compound word |
-| `_parts` | Compound word parts |
-| `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) |
-| `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
-| `_lemma` | Lemma from Zemberek |
-| `_disambiguated` | `True` if context disambiguation was applied |
-| `_root_corrected` | `True` if root was corrected by Zemberek |
-## Batch Tokenization
-```python
-texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
-results = tok.batch_tokenize(texts, workers=4)
-```
-## Statistics
-```python
-tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
-s = tok.stats(tokens)
-print(f"TR coverage: {s['tr_pct']}%")
-```
-## Morphological Fixes Applied
-1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16
-2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
-3. **BPE→SUFFIX** — 260+ suffix patterns reclassified
-4. **Zemberek root validation** — phonetic root correction (`"gök"` → `"göğüs"`)
-5. **Punctuation** — classified as PUNCT (counted in TR coverage)
-6. **Domain vocabulary** — 500+ medical/sports/tourism roots
-7. **TDK FOREIGN detection** — 76K+ Turkish words used as reference
-8. **Special token normalization** — NUM, DATE, URL, MENTION, HASHTAG, EMOJI
-9. **Allomorph canonicalization** — `"lar"/"ler"` → `PL`, `"dan"/"den"` → `ABL`
-10. **Compound decomposition** — `"başbakan"` → `["baş", "bakan"]`
-11. **Acronym expansion** — `"CMV"` → `"Sitomegalovirüs"`
-12. **Context disambiguation** — Zemberek sentence-level POS selection
 ## Benchmark
-| Benchmark | Score |
 |---|---|
-| TR-MMLU | **92%** (world record) |
 ## License
-MIT

 - morphology
 - turkish
 - nlp
+- transformers
 license: mit
 library_name: turk-tokenizer
+pipeline_tag: token-classification
 ---
 # TurkTokenizer
 **Turkish morphological tokenizer — TR-MMLU world record 92%**
+TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
+## Model Details
+| | |
+|---|---|
+| **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
+| **Language** | Turkish (`tr`) |
+| **License** | MIT |
+| **Benchmark** | TR-MMLU **92%** (world record) |
+| **Morphological engine** | Zemberek NLP (bundled) |
+---
+## Use This Model
+### Installation
 ```bash
 pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
 ```
+> **Java is required** for Zemberek morphological analysis.
+> If you get a Java error, install it first:
+>
+> | OS | Command |
+> |---|---|
+> | Ubuntu / Debian | `sudo apt install default-jre` |
+> | Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
+> | macOS | `brew install openjdk` |
+> | Windows | `winget install Microsoft.OpenJDK.21` |
+---
+### With 🤗 Transformers (`AutoTokenizer`)
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
+out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
+print(out["input_ids"])            # hash-stable int IDs
+print(out["attention_mask"])       # [1, 1, 1, ...]
+print(out["token_type_ids"])       # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
+for t in out["morphological_tokens"]:
+    print(t["token"], t["token_type"], t["morph_pos"])
+```
+**Batch tokenization:**
+```python
+out = tok(["Türkçe metin.", "Another sentence with code-switching."])
+# out["input_ids"]  → list of lists
+```
+**Direct morphological tokenization:**
+```python
+tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
+for t in tokens:
+    print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
+    if t.get("_canonical"):   print(f"  [{t['_canonical']}]", end="")
+    if t.get("_compound"):    print(f"  compound={t['_parts']}", end="")
+    if t.get("_expansion"):   print(f"  → {t['_expansion']}", end="")
+    print()
+```
+---
+### Standalone (without Transformers)
 ```python
 from turk_tokenizer import TurkTokenizer
 tok = TurkTokenizer()
+# Single text
+tokens = tok("İSTANBUL'da meeting'e katılamadım")
 for t in tokens:
     print(t["token"], t["token_type"], t["morph_pos"])
+# Batch (parallel, all CPUs)
+results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)
+# TR coverage stats
+s = tok.stats(tokens)
+print(f"TR%: {s['tr_pct']}  Pure%: {s['pure_pct']}")
 ```
+---
+### Example Output
+Input: `"İSTANBUL'da meeting'e katılamadım"`
+| token | token_type | morph_pos | notes |
+|---|---|---|---|
+| `<uppercase_word>` | ROOT | 0 | ALL CAPS marker |
+| ` istanbul` | ROOT | 0 | lowercased |
+| `da` | SUFFIX | 1 | `-LOC` |
+| ` meeting` | FOREIGN | 0 | TDK'da yok |
+| `e` | SUFFIX | 1 | `-DAT` |
+| ` katılama` | ROOT | 0 | Zemberek validated |
+| `dı` | SUFFIX | 1 | `-PST` `[PAST]` |
+| `m` | SUFFIX | 2 | `-1SG` |
+---
 ## Output Fields
+Every token dict contains:
 | Field | Type | Description |
 |---|---|---|
+| `token` | `str` | Token string — leading space means word-initial |
+| `token_type` | `str` | Morphological type (see table below) |
+| `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix… |
 ### Token Types
+| Type | Description | Example |
+|---|---|---|
+| `ROOT` | Turkish root word | `kitap`, `gel` |
+| `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
+| `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
+| `BPE` | Unknown subword (fallback) | rare/OOV fragments |
+| `PUNCT` | Punctuation | `.`, `,`, `?` |
+| `NUM` | Number | `3.5`, `%85` |
+| `DATE` | Date | `14.03.2026` |
+| `UNIT` | Measurement unit | `km`, `mg`, `TL` |
+| `URL` | Web address | `https://...` |
+| `MENTION` | @username | `@ethosoft` |
+| `HASHTAG` | #topic | `#NLP` |
+| `EMOJI` | Emoji | `😊` |
 ### Optional Metadata Fields
 | Field | Description |
 |---|---|
+| `_canonical` | Canonical morpheme: `"lar"/"ler"` → `"PL"`, `"dan"/"den"` → `"ABL"` |
+| `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, … |
+| `_foreign` | `True` — foreign root detected by TDK lookup |
+| `_caps` | `True` — originally ALL CAPS word |
+| `_domain` | `True` — medical / sports / tourism domain word |
+| `_compound` | `True` — compound word (e.g. `başbakan`) |
+| `_parts` | Compound parts: `["baş", "bakan"]` |
+| `_expansion` | Acronym expansion: `"CMV"` → `"Sitomegalovirüs"` |
+| `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`… |
+| `_lemma` | Lemma from Zemberek: `"gelir"` → `"gelmek"` (when verb) |
+| `_disambiguated` | `True` — context disambiguation applied (`"yüz"`, `"gelir"`…) |
+| `_root_corrected` | `True` — phonetic root correction: `"gök"` → `"göğüs"` |
+---
+## How It Works
+TurkTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
+| Fix | Problem | Solution |
+|---|---|---|
+| 1 | `İSTANBUL` → 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
+| 2 | `meeting'e` → broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
+| 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified → SUFFIX |
+| 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
+| 5 | Punctuation counted as BPE | Classify as PUNCT |
+| 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
+| 7 | Foreign words as BPE | TDK 76K+ word lookup → FOREIGN ROOT |
+| 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
+| 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`…) |
+| 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
+| 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
+| 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
+---
 ## Benchmark
+| Model | TR-MMLU |
 |---|---|
+| GPT-4o | 78.3% |
+| Llama-3-70B | 74.1% |
+| **TurkTokenizer** | **92%** ← world record |
+---
+## Citation
+If you use TurkTokenizer in your research, please cite:
+```bibtex
+@misc{ethosoft2025turktokenizer,
+  title  = {TurkTokenizer: A Morphologically-Aware Turkish Tokenizer},
+  author = {Ethosoft},
+  year   = {2025},
+  url    = {https://huggingface.co/Ethosoft/turk-tokenizer}
+}
+```
+---
 ## License
+MIT © [Ethosoft](https://huggingface.co/Ethosoft)