fzengin18 commited on
Commit
d11ea56
·
verified ·
1 Parent(s): deb7091

Update model card

Browse files
Files changed (1) hide show
  1. README.md +290 -3
README.md CHANGED
@@ -1,3 +1,290 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multrenizer
2
+
3
+ Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text.
4
+
5
+ ## Links
6
+
7
+ - Repository: [github.com/fzengin19/multrenizer](https://github.com/fzengin19/multrenizer)
8
+
9
+ ## Why Multrenizer?
10
+
11
+ Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless `I/i` rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols.
12
+
13
+ Core design goals:
14
+
15
+ - Turkish-aware normalization: hardcoded `İ -> i` and `I -> ı` before Unicode normalization
16
+ - Apostrophe preservation: forms like `feature'ı`, `merge'lemek`, `İstanbul'da`, and `can't` keep `'` as a real token
17
+ - Compact vocabulary budget: `~26K` target vocab for a Turkish-first bilingual tokenizer
18
+ - Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols
19
+ - Code-switching support: trained on mixed TR-EN text instead of treating it as noise
20
+
21
+ ## Benchmark Results
22
+
23
+ Evaluated on `5,000` Turkish sentences, `5,000` English sentences, and `500` code-switching sentences from the prepared corpus against 5 reference tokenizers.
24
+
25
+ Notes:
26
+
27
+ - Multrenizer's shipped local artifact is auto-read from `multrenizer-tokenizer/tokenizer.json`; the current released artifact is `25,917` tokens.
28
+ - Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding.
29
+
30
+ ### Compared Tokenizers
31
+
32
+ | Tokenizer | Source | Vocab Size | Algorithm | Type |
33
+ |---|---|---:|---|---|
34
+ | **Multrenizer** | This project | **25,917** | Unigram | Bilingual EN-TR, purpose-built |
35
+ | **Kumru-2B** | [vngrs-ai/Kumru-2B](https://huggingface.co/vngrs-ai/Kumru-2B) | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) |
36
+ | **Turkcell-7B** | [TURKCELL/Turkcell-LLM-7b-v1](https://huggingface.co/TURKCELL/Turkcell-LLM-7b-v1) | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) |
37
+ | **GPT-2** | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | 50,257 | BPE | English-centric baseline (OpenAI, 2019) |
38
+ | **Qwen-3** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | 151,643 | BPE | Multilingual (Alibaba, 2025) |
39
+ | **Mistral-3.1** | [mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) |
40
+
41
+ ### Fertility, Compression, and Token Count
42
+
43
+ Lower fertility means fewer tokens per word. Higher compression means more characters carried per token.
44
+
45
+ | Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
46
+ |---|:---:|:---:|:---:|:---:|:---:|:---:|
47
+ | Vocab Size | **25,917** | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 |
48
+ | **TR Fertility** | **1.627** | 1.649 | 1.917 | 3.785 | 2.616 | 2.384 |
49
+ | EN Fertility | 1.525 | 2.151 | 1.555 | **1.314** | 1.372 | 1.381 |
50
+ | **CS Fertility** | **1.756** | 1.923 | 1.832 | 3.475 | 2.445 | 2.479 |
51
+ | **TR Compression** | **4.783** | 4.719 | 4.060 | 2.056 | 2.976 | 3.265 |
52
+ | EN Compression | 4.148 | 2.942 | 4.068 | **4.816** | 4.610 | 4.580 |
53
+ | **TR Total Tokens (5K)** | **130,844** | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 |
54
+ | EN Total Tokens (5K) | 157,027 | 221,420 | 160,121 | **135,235** | 141,275 | 142,196 |
55
+ | **CS Total Tokens (500)** | **5,525** | 6,050 | 5,762 | 10,933 | 7,693 | 7,799 |
56
+
57
+ Current position:
58
+
59
+ - Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens
60
+ - Best code-switching efficiency in this comparison set: CS fertility and CS total tokens
61
+ - Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count
62
+ - Only tokenizer here that passes Turkish `I/i` normalization correctly
63
+
64
+ ### Morphological Splitting
65
+
66
+ Total tokens needed to represent 10 difficult Turkish words:
67
+
68
+ | Tokenizer | Vocab Size | Total Tokens | Avg per Word |
69
+ |---|---:|:---:|:---:|
70
+ | **Multrenizer** | **25,917** | **32** | **3.2** |
71
+ | Kumru-2B | 50,176 | 35 | 3.5 |
72
+ | Turkcell-7B | 48,351 | 38 | 3.8 |
73
+ | Mistral-3.1 | 131,072 | 71 | 7.1 |
74
+ | Qwen-3 | 151,643 | 73 | 7.3 |
75
+ | GPT-2 | 50,257 | 105 | 10.5 |
76
+
77
+ Selected examples:
78
+
79
+ ```text
80
+ güzelleştirilmiş
81
+ Multrenizer: güzel + leştirilmiş [2 tokens]
82
+ Kumru-2B: 2 tokens
83
+ Turkcell-7B: güzel + leştirilmiş [2 tokens]
84
+ Qwen-3: 5 tokens
85
+ Mistral-3.1: 5 tokens
86
+ GPT-2: 10 tokens
87
+
88
+ İstanbul'da
89
+ Multrenizer: istanbul + ' + da [3 tokens]
90
+ Kumru-2B: 3 tokens
91
+ Turkcell-7B: İstanbul + ' + da [3 tokens]
92
+ Qwen-3: 4 tokens
93
+ Mistral-3.1: 4 tokens
94
+ GPT-2: 5 tokens
95
+
96
+ Afyonkarahisarlılaştıramadıklarımızdan
97
+ Multrenizer: afyonkarahisar + lı + laştı + r + ama + dıkları + mızda + n [8 tokens]
98
+ Kumru-2B: 8 tokens
99
+ Turkcell-7B: 9 tokens
100
+ Qwen-3: 16 tokens
101
+ Mistral-3.1: 16 tokens
102
+ GPT-2: 21 tokens
103
+ ```
104
+
105
+ ### Turkish I/i Normalization
106
+
107
+ This is the critical locale-sensitive test:
108
+
109
+ - `İ` must lowercase to `i`
110
+ - `I` must lowercase to `ı`
111
+
112
+ | Input | Expected | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
113
+ |---|---|:---:|:---:|:---:|:---:|:---:|:---:|
114
+ | İstanbul | istanbul | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
115
+ | IŞIK | ışık | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
116
+ | SIR | sır | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
117
+ | İNSAN | insan | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
118
+ | ISITMAK | ısıtmak | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
119
+ | **Score** | | **8/8** | **0/8** | **0/8** | **0/8** | **0/8** | **0/8** |
120
+
121
+ Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly.
122
+
123
+ ### Code-Switching
124
+
125
+ ```text
126
+ "Bu feature'ı implement ederken edge case'leri handle etmeyi unutmayalım."
127
+
128
+ Multrenizer [15 tok] bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lım | .
129
+ Kumru-2B [20 tok] Bu | fe | ature | ' | ı | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalım | .
130
+ Turkcell-7B [15 tok] Bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalım | .
131
+ GPT-2 [24 tok] Bu | feature | ' | ı | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | ı | m | .
132
+ Qwen-3 [22 tok] Bu | feature | ' | ı | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alım | .
133
+ Mistral-3.1 [20 tok] Bu | feature | 'ı | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | ım | .
134
+
135
+ "merge'lemek istediğim branch conflict veriyor."
136
+
137
+ Multrenizer [ 8 tok] merge | ' | lemek | istediğim | branch | conflict | veriyor | .
138
+ Kumru-2B [14 tok] mer | ge | ' | lemek | istediÄŁim | b | ran | ch | con | f | lic | t | veriyor | .
139
+ Turkcell-7B [ 8 tok] merge | ' | lemek | istediğim | branch | conflict | veriyor | .
140
+ GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | ÄŁ | im | branch | conflict | ver | iy | or | .
141
+ Qwen-3 [11 tok] merge | ' | lem | ek | istediÄŁ | im | branch | conflict | ver | iyor | .
142
+ Mistral-3.1 [13 tok] merge | ' | le | mek | ist | edi | ÄŁ | im | branch | conflict | ver | iyor | .
143
+ ```
144
+
145
+ ## Quick Start
146
+
147
+ ### Installation
148
+
149
+ ```bash
150
+ git clone https://github.com/fzengin19/multrenizer.git
151
+ cd multrenizer
152
+ python -m venv .venv
153
+ source .venv/bin/activate
154
+ pip install -r requirements.txt
155
+ ```
156
+
157
+ ### Use the shipped tokenizer
158
+
159
+ ```python
160
+ from tokenizers import Tokenizer
161
+
162
+ tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")
163
+
164
+ encoded = tok.encode("İstanbul'da güzel bir gün")
165
+ print(encoded.tokens)
166
+ # ['<s>', 'istanbul', "'", 'da', 'güzel', 'bir', 'gün', '</s>']
167
+
168
+ print(tok.normalizer.normalize_str("IŞIK"))
169
+ # 'ışık'
170
+ ```
171
+
172
+ ### Train from scratch
173
+
174
+ ```bash
175
+ # 1. Download and prepare corpus
176
+ python prepare_data.py --size medium
177
+
178
+ # 2. Train tokenizer
179
+ python train_tokenizer.py --data-dir data/
180
+
181
+ # 3. Optional: push tokenizer files to Hugging Face Hub
182
+ python train_tokenizer.py --data-dir data/ \
183
+ --repo-id your-username/multrenizer \
184
+ --hf-token hf_xxxxx
185
+ ```
186
+
187
+ ### Run benchmarks
188
+
189
+ ```bash
190
+ python benchmark.py --tr-lines 5000 --en-lines 5000
191
+ ```
192
+
193
+ ## Architecture
194
+
195
+ ### Pipeline
196
+
197
+ ```text
198
+ Raw text
199
+ -> Turkish I/i normalizer (Replace: İ->i, I->ı, i̇->i)
200
+ -> Quote canonicalization (’ ‘ ʼ ' -> ')
201
+ -> NFKC normalization
202
+ -> Lowercase
203
+ -> Strip whitespace
204
+ -> Pre-tokenizer (whitespace + apostrophe + punctuation split)
205
+ -> Unigram model (~26K target vocab)
206
+ -> Post-processor (<s> ... </s>)
207
+ ```
208
+
209
+ ### Data Mix
210
+
211
+ The released artifact is trained with the default file-based interleave in `train_tokenizer.py`, which approximates:
212
+
213
+ | Stream | Share | Purpose |
214
+ |---|---|---|
215
+ | Turkish | ~60% | Core Turkish morphology |
216
+ | English | ~30% | English coverage |
217
+ | Code-switching | ~10% | TR-EN boundary handling |
218
+
219
+ Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
220
+
221
+ ### Vocabulary Budget
222
+
223
+ Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
224
+
225
+ - `32` named special tokens
226
+ - `512` reserved tokens
227
+ - `292` utility tokens
228
+ - up to `25,164` learned subword tokens
229
+
230
+ Current shipped artifact: `25,917` total tokens.
231
+
232
+ ### Special Tokens
233
+
234
+ | Category | IDs | Tokens | Purpose |
235
+ |---|---|---|---|
236
+ | **Core** | 0-3 | `<unk>` `<s>` `</s>` `<pad>` | Basic tokenizer operation |
237
+ | **Chat** | 4-8 | `<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|end\|>` `<\|sep\|>` | Instruction tuning and chat models |
238
+ | **Reasoning** | 9-12 | `<think>` `</think>` `<\|step\|>` `<\|reflection\|>` | Reasoning traces and self-check markers |
239
+ | **Tool Use** | 13-16 | `<tool_call>` `</tool_call>` `<tool_response>` `</tool_response>` | Tool and function calling |
240
+ | **Code/FIM** | 17-20 | `<\|code\|>` `<\|fim_prefix\|>` `<\|fim_middle\|>` `<\|fim_suffix\|>` | Code and fill-in-middle workflows |
241
+ | **Bilingual** | 21-22 | `<\|tr\|>` `<\|en\|>` | Language tags |
242
+ | **RAG** | 23-24 | `<\|context\|>` `<\|/context\|>` | Retrieval boundaries |
243
+ | **Multi-modal** | 25-28 | `<\|image\|>` `<\|audio\|>` `<\|video\|>` `<\|file\|>` | Placeholder tokens |
244
+ | **Structured** | 29-31 | `<\|json\|>` `<\|table\|>` `<\|cite\|>` | Structured output markers |
245
+ | **Reserved** | 32-543 | `<\|reserved_0\|>` ... `<\|reserved_511\|>` | Future growth without retraining |
246
+ | **Utility** | 544+ | Punctuation, emoji, math, currency, typography | Critical text symbols kept intact |
247
+
248
+ ### Utility Tokens
249
+
250
+ | Category | Count | Examples |
251
+ |---|---:|---|
252
+ | Punctuation | 31 | `. , ! ? ; : - ( ) [ ] { } / \ " ' ...` |
253
+ | Currency & Business | 15 | `₺ $ € £ ¥ % @ # &` |
254
+ | Math & Science | 25 | `± × ÷ ≠ ≤ ≥ ∞ √ π α β γ` |
255
+ | Arrows & Symbols | 15 | `→ ← ↑ ↓ • ★ ☆ ✓ ✗ © ® ™` |
256
+ | Typography | 10 | `« » “ ” ‘ ’ ‹ › „ ‚` |
257
+ | Emoji (faces) | 70 | `😀 😂 🤣 😊 😍 🤔 😭 😡 💀 🤖` |
258
+ | Emoji (hands) | 28 | `👋 👍 👎 👏 🙏 💪 ✊ ✌️` |
259
+ | Emoji (hearts) | 18 | `❤️ 💛 💚 💙 💜 🖤 💔` |
260
+ | Emoji (symbols) | 36 | `🔥 ✨ ⭐ ✅ ❌ ⚠️ 💯 🚀` |
261
+ | Emoji (objects) | 36 | `💻 📱 🎯 🏆 📊 ☕ 🔗 💰` |
262
+ | Emoji (flags) | 8 | `🇹🇷 🇺🇸 🇬🇧 🇩🇪 🇫🇷 🇪🇸 🇮🇹 🇯🇵` |
263
+
264
+ ## Project Structure
265
+
266
+ ```text
267
+ multrenizer/
268
+ ├── multrenizer-tokenizer/ # Trained tokenizer artifact
269
+ │ ├── tokenizer.json
270
+ │ ├── tokenizer_config.json
271
+ │ └── special_tokens_map.json
272
+ ├── prepare_data.py # Corpus download and preparation
273
+ ├── train_tokenizer.py # Tokenizer training script
274
+ ├── benchmark.py # Benchmark against 5 reference tokenizers
275
+ ├── benchmark_results.json # Full benchmark output
276
+ ├── tests/ # Regression tests for tokenizer behavior
277
+ ├── requirements.txt
278
+ └── pyproject.toml
279
+ ```
280
+
281
+ ## References
282
+
283
+ - [Tokens with Meaning: A Hybrid Tokenization Approach for Turkish](https://arxiv.org/html/2508.14292v2)
284
+ - [Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark](https://arxiv.org/html/2502.07057v1)
285
+ - [Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE](https://arxiv.org/abs/2508.08424)
286
+ - [Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration](https://blog.squeezebits.com/vocabulary-trimming-methods)
287
+
288
+ ## License
289
+
290
+ Apache 2.0