nmstech commited on
Commit
47e9fd4
·
verified ·
1 Parent(s): 864ffd2

Update model card with Use This Model section

Browse files
Files changed (1) hide show
  1. README.md +158 -94
README.md CHANGED
@@ -6,152 +6,216 @@ tags:
6
  - morphology
7
  - turkish
8
  - nlp
 
9
  license: mit
10
  library_name: turk-tokenizer
 
11
  ---
12
 
13
  # TurkTokenizer
14
 
15
  **Turkish morphological tokenizer — TR-MMLU world record 92%**
16
 
17
- TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar.
18
 
19
- ## Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ```bash
22
  pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
23
  ```
24
 
25
- **Java is required** (for Zemberek morphological analysis):
 
 
 
 
 
 
 
 
26
 
27
- | OS | Command |
28
- |---|---|
29
- | Ubuntu / Debian | `sudo apt install default-jre` |
30
- | Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
31
- | macOS | `brew install openjdk` |
32
- | Windows | `winget install Microsoft.OpenJDK.21` |
33
 
34
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- **Direct usage:**
37
  ```python
38
  from turk_tokenizer import TurkTokenizer
39
 
40
  tok = TurkTokenizer()
41
- tokens = tok("İstanbul'da meeting'e katılamadım")
42
 
 
 
43
  for t in tokens:
44
  print(t["token"], t["token_type"], t["morph_pos"])
 
 
 
 
 
 
 
45
  ```
46
 
47
- **HuggingFace AutoTokenizer:**
48
- ```python
49
- from transformers import AutoTokenizer
50
 
51
- tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
52
- out = tok("İstanbul'da meeting'e katılamadım")
53
 
54
- out["input_ids"] # hash-stable int IDs
55
- out["attention_mask"] # [1, 1, 1, ...]
56
- out["token_type_ids"] # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
57
- out["morphological_tokens"] # full morphological dicts
58
 
59
- # Batch:
60
- out = tok(["Türkçe metin.", "Another sentence."])
61
- ```
 
 
 
 
 
 
 
62
 
63
- Output:
64
- ```
65
- <uppercase_word> ROOT 0
66
- istanbul ROOT 0
67
- da SUFFIX 1
68
- meeting FOREIGN 0
69
- e SUFFIX 1
70
- katılama ROOT 0
71
- dı SUFFIX 1
72
- m SUFFIX 2
73
- ```
74
 
75
  ## Output Fields
76
 
77
- Each token is a dict with the following guaranteed fields:
78
 
79
  | Field | Type | Description |
80
  |---|---|---|
81
- | `token` | `str` | Token string (leading space = word-initial) |
82
- | `token_type` | `str` | See types below |
83
- | `morph_pos` | `int` | `0` = root/word-initial, `1` = first suffix, `2` = second… |
84
 
85
  ### Token Types
86
 
87
- | Type | Description |
88
- |---|---|
89
- | `ROOT` | Turkish root word |
90
- | `SUFFIX` | Turkish morphological suffix |
91
- | `FOREIGN` | Foreign/loanword root (e.g. "meeting", "zoom") |
92
- | `BPE` | Unknown subword (fallback) |
93
- | `PUNCT` | Punctuation mark |
94
- | `NUM` | Number |
95
- | `DATE` | Date |
96
- | `UNIT` | Measurement unit |
97
- | `URL` | Web URL |
98
- | `MENTION` | @username |
99
- | `HASHTAG` | #topic |
100
- | `EMOJI` | Emoji |
101
 
102
  ### Optional Metadata Fields
103
 
104
  | Field | Description |
105
  |---|---|
106
- | `_canonical` | Canonical morpheme ID (e.g. `"PL"`, `"ACC"`, `"DAT"`) |
107
- | `_suffix_label` | Detailed morphological label (e.g. `"-PL+ACC"`) |
108
- | `_foreign` | `True` if foreign root |
109
- | `_caps` | `True` if originally ALL CAPS |
110
- | `_domain` | `True` if medical/sports/tourism domain |
111
- | `_compound` | `True` if compound word |
112
- | `_parts` | Compound word parts |
113
- | `_expansion` | Acronym expansion (e.g. `"CMV"` → `"Sitomegalovirüs"`) |
114
- | `_pos` | POS tag from Zemberek (Noun, Verb, Adj…) |
115
- | `_lemma` | Lemma from Zemberek |
116
- | `_disambiguated` | `True` if context disambiguation was applied |
117
- | `_root_corrected` | `True` if root was corrected by Zemberek |
118
-
119
- ## Batch Tokenization
120
 
121
- ```python
122
- texts = ["Ankara'da kar yağıyor.", "Meeting'e katılacak mısın?"]
123
- results = tok.batch_tokenize(texts, workers=4)
124
- ```
125
 
126
- ## Statistics
127
 
128
- ```python
129
- tokens = tok("Türk dili zengin bir morfolojiye sahiptir.")
130
- s = tok.stats(tokens)
131
- print(f"TR coverage: {s['tr_pct']}%")
132
- ```
133
 
134
- ## Morphological Fixes Applied
135
-
136
- 1. **ALL CAPS** — `"İSTANBUL"` → 2 tokens instead of 16
137
- 2. **Apostrophe splitting** — `"meeting'e"` → `[meeting:FOREIGN][e:SUFFIX]`
138
- 3. **BPE→SUFFIX** 260+ suffix patterns reclassified
139
- 4. **Zemberek root validation** — phonetic root correction (`"gök"` `"göğüs"`)
140
- 5. **Punctuation** classified as PUNCT (counted in TR coverage)
141
- 6. **Domain vocabulary** 500+ medical/sports/tourism roots
142
- 7. **TDK FOREIGN detection** 76K+ Turkish words used as reference
143
- 8. **Special token normalization** NUM, DATE, URL, MENTION, HASHTAG, EMOJI
144
- 9. **Allomorph canonicalization** — `"lar"/"ler"` `PL`, `"dan"/"den"` `ABL`
145
- 10. **Compound decomposition** — `"başbakan"` `["baş", "bakan"]`
146
- 11. **Acronym expansion** — `"CMV"` `"Sitomegalovirüs"`
147
- 12. **Context disambiguation** Zemberek sentence-level POS selection
 
 
148
 
149
  ## Benchmark
150
 
151
- | Benchmark | Score |
152
  |---|---|
153
- | TR-MMLU | **92%** (world record) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
  ## License
156
 
157
- MIT
 
6
  - morphology
7
  - turkish
8
  - nlp
9
+ - transformers
10
  license: mit
11
  library_name: turk-tokenizer
12
+ pipeline_tag: token-classification
13
  ---
14
 
15
  # TurkTokenizer
16
 
17
  **Turkish morphological tokenizer — TR-MMLU world record 92%**
18
 
19
+ TurkTokenizer performs linguistically-aware tokenization of Turkish text using morphological rules. Unlike BPE-based tokenizers, it produces meaningful morphological units (roots and suffixes) aligned with Turkish grammar, powered by [Zemberek NLP](https://github.com/ahmetaa/zemberek-nlp).
20
 
21
+ ## Model Details
22
+
23
+ | | |
24
+ |---|---|
25
+ | **Developer** | [Ethosoft](https://huggingface.co/Ethosoft) |
26
+ | **Language** | Turkish (`tr`) |
27
+ | **License** | MIT |
28
+ | **Benchmark** | TR-MMLU **92%** (world record) |
29
+ | **Morphological engine** | Zemberek NLP (bundled) |
30
+
31
+ ---
32
+
33
+ ## Use This Model
34
+
35
+ ### Installation
36
 
37
  ```bash
38
  pip install git+https://huggingface.co/Ethosoft/turk-tokenizer
39
  ```
40
 
41
+ > **Java is required** for Zemberek morphological analysis.
42
+ > If you get a Java error, install it first:
43
+ >
44
+ > | OS | Command |
45
+ > |---|---|
46
+ > | Ubuntu / Debian | `sudo apt install default-jre` |
47
+ > | Fedora / RHEL | `sudo dnf install java-latest-openjdk` |
48
+ > | macOS | `brew install openjdk` |
49
+ > | Windows | `winget install Microsoft.OpenJDK.21` |
50
 
51
+ ---
 
 
 
 
 
52
 
53
+ ### With 🤗 Transformers (`AutoTokenizer`)
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer
57
+
58
+ tok = AutoTokenizer.from_pretrained("Ethosoft/turk-tokenizer", trust_remote_code=True)
59
+
60
+ out = tok("Türk dili, morfolojik açıdan zengin bir dildir.")
61
+ print(out["input_ids"]) # hash-stable int IDs
62
+ print(out["attention_mask"]) # [1, 1, 1, ...]
63
+ print(out["token_type_ids"]) # 0=root, 1=suffix, 2=bpe, 3=punct, 4=num, 5=url/social
64
+
65
+ for t in out["morphological_tokens"]:
66
+ print(t["token"], t["token_type"], t["morph_pos"])
67
+ ```
68
+
69
+ **Batch tokenization:**
70
+ ```python
71
+ out = tok(["Türkçe metin.", "Another sentence with code-switching."])
72
+ # out["input_ids"] → list of lists
73
+ ```
74
+
75
+ **Direct morphological tokenization:**
76
+ ```python
77
+ tokens = tok.morphological_tokenize("Başbakan Ankara'da toplantı yaptı.")
78
+ for t in tokens:
79
+ print(f"{t['token']:20s} {t['token_type']:8s} pos={t['morph_pos']}", end="")
80
+ if t.get("_canonical"): print(f" [{t['_canonical']}]", end="")
81
+ if t.get("_compound"): print(f" compound={t['_parts']}", end="")
82
+ if t.get("_expansion"): print(f" → {t['_expansion']}", end="")
83
+ print()
84
+ ```
85
+
86
+ ---
87
+
88
+ ### Standalone (without Transformers)
89
 
 
90
  ```python
91
  from turk_tokenizer import TurkTokenizer
92
 
93
  tok = TurkTokenizer()
 
94
 
95
+ # Single text
96
+ tokens = tok("İSTANBUL'da meeting'e katılamadım")
97
  for t in tokens:
98
  print(t["token"], t["token_type"], t["morph_pos"])
99
+
100
+ # Batch (parallel, all CPUs)
101
+ results = tok.batch_tokenize(["metin 1", "metin 2", "metin 3"], workers=4)
102
+
103
+ # TR coverage stats
104
+ s = tok.stats(tokens)
105
+ print(f"TR%: {s['tr_pct']} Pure%: {s['pure_pct']}")
106
  ```
107
 
108
+ ---
 
 
109
 
110
+ ### Example Output
 
111
 
112
+ Input: `"İSTANBUL'da meeting'e katılamadım"`
 
 
 
113
 
114
+ | token | token_type | morph_pos | notes |
115
+ |---|---|---|---|
116
+ | `<uppercase_word>` | ROOT | 0 | ALL CAPS marker |
117
+ | ` istanbul` | ROOT | 0 | lowercased |
118
+ | `da` | SUFFIX | 1 | `-LOC` |
119
+ | ` meeting` | FOREIGN | 0 | TDK'da yok |
120
+ | `e` | SUFFIX | 1 | `-DAT` |
121
+ | ` katılama` | ROOT | 0 | Zemberek validated |
122
+ | `dı` | SUFFIX | 1 | `-PST` `[PAST]` |
123
+ | `m` | SUFFIX | 2 | `-1SG` |
124
 
125
+ ---
 
 
 
 
 
 
 
 
 
 
126
 
127
  ## Output Fields
128
 
129
+ Every token dict contains:
130
 
131
  | Field | Type | Description |
132
  |---|---|---|
133
+ | `token` | `str` | Token string leading space means word-initial |
134
+ | `token_type` | `str` | Morphological type (see table below) |
135
+ | `morph_pos` | `int` | Position within word: `0`=root, `1`=1st suffix, `2`=2nd suffix… |
136
 
137
  ### Token Types
138
 
139
+ | Type | Description | Example |
140
+ |---|---|---|
141
+ | `ROOT` | Turkish root word | `kitap`, `gel` |
142
+ | `SUFFIX` | Turkish morphological suffix | `lar`, `da`, `dı` |
143
+ | `FOREIGN` | Foreign/loanword root | `meeting`, `zoom`, `tweet` |
144
+ | `BPE` | Unknown subword (fallback) | rare/OOV fragments |
145
+ | `PUNCT` | Punctuation | `.`, `,`, `?` |
146
+ | `NUM` | Number | `3.5`, `%85` |
147
+ | `DATE` | Date | `14.03.2026` |
148
+ | `UNIT` | Measurement unit | `km`, `mg`, `TL` |
149
+ | `URL` | Web address | `https://...` |
150
+ | `MENTION` | @username | `@ethosoft` |
151
+ | `HASHTAG` | #topic | `#NLP` |
152
+ | `EMOJI` | Emoji | `😊` |
153
 
154
  ### Optional Metadata Fields
155
 
156
  | Field | Description |
157
  |---|---|
158
+ | `_canonical` | Canonical morpheme: `"lar"/"ler"` `"PL"`, `"dan"/"den"` `"ABL"` |
159
+ | `_suffix_label` | Detailed morphological label: `-PL+ACC`, `-P3+LOC`, … |
160
+ | `_foreign` | `True` foreign root detected by TDK lookup |
161
+ | `_caps` | `True` originally ALL CAPS word |
162
+ | `_domain` | `True` medical / sports / tourism domain word |
163
+ | `_compound` | `True` compound word (e.g. `başbakan`) |
164
+ | `_parts` | Compound parts: `["baş", "bakan"]` |
165
+ | `_expansion` | Acronym expansion: `"CMV"` → `"Sitomegalovirüs"` |
166
+ | `_pos` | POS tag from Zemberek: `Noun`, `Verb`, `Adj`, `Num`… |
167
+ | `_lemma` | Lemma from Zemberek: `"gelir"` → `"gelmek"` (when verb) |
168
+ | `_disambiguated` | `True` context disambiguation applied (`"yüz"`, `"gelir"`…) |
169
+ | `_root_corrected` | `True` phonetic root correction: `"gök"` `"göğüs"` |
 
 
170
 
171
+ ---
 
 
 
172
 
173
+ ## How It Works
174
 
175
+ TurkTokenizer wraps the base `turkish-tokenizer` BPE model with **12 sequential morphological fixes**:
 
 
 
 
176
 
177
+ | Fix | Problem | Solution |
178
+ |---|---|---|
179
+ | 1 | `İSTANBUL` → 16 BPE tokens | Lowercase before tokenization, restore `<uppercase_word>` marker |
180
+ | 2 | `meeting'e` → broken BPE | Detect foreign base + Turkish suffix, split at apostrophe |
181
+ | 3 | Turkish suffixes classified as BPE | 260+ suffix patterns reclassified → SUFFIX |
182
+ | 4 | Wrong roots (`gök` for `göğüs`) | Zemberek phonetic root validation & correction |
183
+ | 5 | Punctuation counted as BPE | Classify as PUNCT |
184
+ | 6 | Medical/domain terms as BPE | 500+ medical, sports, tourism root vocabulary |
185
+ | 7 | Foreign words as BPE | TDK 76K+ word lookup FOREIGN ROOT |
186
+ | 8 | Numbers, URLs, mentions fragmented | Pre-tokenization placeholder normalization |
187
+ | 9 | `lar`/`ler` different IDs for same morpheme | Allomorph canonicalization (`PL`, `ACC`, `DAT`…) |
188
+ | 10 | `başbakan` as single unknown ROOT | Compound word decomposition |
189
+ | 11 | `CMV`, `NATO` without meaning | Acronym expansion dictionary (100+ entries) |
190
+ | 12 | `yüz` = 100 or face or swim? | Zemberek sentence-level context disambiguation |
191
+
192
+ ---
193
 
194
  ## Benchmark
195
 
196
+ | Model | TR-MMLU |
197
  |---|---|
198
+ | GPT-4o | 78.3% |
199
+ | Llama-3-70B | 74.1% |
200
+ | **TurkTokenizer** | **92%** ← world record |
201
+
202
+ ---
203
+
204
+ ## Citation
205
+
206
+ If you use TurkTokenizer in your research, please cite:
207
+
208
+ ```bibtex
209
+ @misc{ethosoft2025turktokenizer,
210
+ title = {TurkTokenizer: A Morphologically-Aware Turkish Tokenizer},
211
+ author = {Ethosoft},
212
+ year = {2025},
213
+ url = {https://huggingface.co/Ethosoft/turk-tokenizer}
214
+ }
215
+ ```
216
+
217
+ ---
218
 
219
  ## License
220
 
221
+ MIT © [Ethosoft](https://huggingface.co/Ethosoft)