Update model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,290 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multrenizer
|
| 2 |
+
|
| 3 |
+
Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text.
|
| 4 |
+
|
| 5 |
+
## Links
|
| 6 |
+
|
| 7 |
+
- Repository: [github.com/fzengin19/multrenizer](https://github.com/fzengin19/multrenizer)
|
| 8 |
+
|
| 9 |
+
## Why Multrenizer?
|
| 10 |
+
|
| 11 |
+
Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless `I/i` rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols.
|
| 12 |
+
|
| 13 |
+
Core design goals:
|
| 14 |
+
|
| 15 |
+
- Turkish-aware normalization: hardcoded `İ -> i` and `I -> ı` before Unicode normalization
|
| 16 |
+
- Apostrophe preservation: forms like `feature'ı`, `merge'lemek`, `İstanbul'da`, and `can't` keep `'` as a real token
|
| 17 |
+
- Compact vocabulary budget: `~26K` target vocab for a Turkish-first bilingual tokenizer
|
| 18 |
+
- Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols
|
| 19 |
+
- Code-switching support: trained on mixed TR-EN text instead of treating it as noise
|
| 20 |
+
|
| 21 |
+
## Benchmark Results
|
| 22 |
+
|
| 23 |
+
Evaluated on `5,000` Turkish sentences, `5,000` English sentences, and `500` code-switching sentences from the prepared corpus against 5 reference tokenizers.
|
| 24 |
+
|
| 25 |
+
Notes:
|
| 26 |
+
|
| 27 |
+
- Multrenizer's shipped local artifact is auto-read from `multrenizer-tokenizer/tokenizer.json`; the current released artifact is `25,917` tokens.
|
| 28 |
+
- Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding.
|
| 29 |
+
|
| 30 |
+
### Compared Tokenizers
|
| 31 |
+
|
| 32 |
+
| Tokenizer | Source | Vocab Size | Algorithm | Type |
|
| 33 |
+
|---|---|---:|---|---|
|
| 34 |
+
| **Multrenizer** | This project | **25,917** | Unigram | Bilingual EN-TR, purpose-built |
|
| 35 |
+
| **Kumru-2B** | [vngrs-ai/Kumru-2B](https://huggingface.co/vngrs-ai/Kumru-2B) | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) |
|
| 36 |
+
| **Turkcell-7B** | [TURKCELL/Turkcell-LLM-7b-v1](https://huggingface.co/TURKCELL/Turkcell-LLM-7b-v1) | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) |
|
| 37 |
+
| **GPT-2** | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | 50,257 | BPE | English-centric baseline (OpenAI, 2019) |
|
| 38 |
+
| **Qwen-3** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | 151,643 | BPE | Multilingual (Alibaba, 2025) |
|
| 39 |
+
| **Mistral-3.1** | [mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) |
|
| 40 |
+
|
| 41 |
+
### Fertility, Compression, and Token Count
|
| 42 |
+
|
| 43 |
+
Lower fertility means fewer tokens per word. Higher compression means more characters carried per token.
|
| 44 |
+
|
| 45 |
+
| Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|
| 46 |
+
|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 47 |
+
| Vocab Size | **25,917** | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 |
|
| 48 |
+
| **TR Fertility** | **1.627** | 1.649 | 1.917 | 3.785 | 2.616 | 2.384 |
|
| 49 |
+
| EN Fertility | 1.525 | 2.151 | 1.555 | **1.314** | 1.372 | 1.381 |
|
| 50 |
+
| **CS Fertility** | **1.756** | 1.923 | 1.832 | 3.475 | 2.445 | 2.479 |
|
| 51 |
+
| **TR Compression** | **4.783** | 4.719 | 4.060 | 2.056 | 2.976 | 3.265 |
|
| 52 |
+
| EN Compression | 4.148 | 2.942 | 4.068 | **4.816** | 4.610 | 4.580 |
|
| 53 |
+
| **TR Total Tokens (5K)** | **130,844** | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 |
|
| 54 |
+
| EN Total Tokens (5K) | 157,027 | 221,420 | 160,121 | **135,235** | 141,275 | 142,196 |
|
| 55 |
+
| **CS Total Tokens (500)** | **5,525** | 6,050 | 5,762 | 10,933 | 7,693 | 7,799 |
|
| 56 |
+
|
| 57 |
+
Current position:
|
| 58 |
+
|
| 59 |
+
- Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens
|
| 60 |
+
- Best code-switching efficiency in this comparison set: CS fertility and CS total tokens
|
| 61 |
+
- Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count
|
| 62 |
+
- Only tokenizer here that passes Turkish `I/i` normalization correctly
|
| 63 |
+
|
| 64 |
+
### Morphological Splitting
|
| 65 |
+
|
| 66 |
+
Total tokens needed to represent 10 difficult Turkish words:
|
| 67 |
+
|
| 68 |
+
| Tokenizer | Vocab Size | Total Tokens | Avg per Word |
|
| 69 |
+
|---|---:|:---:|:---:|
|
| 70 |
+
| **Multrenizer** | **25,917** | **32** | **3.2** |
|
| 71 |
+
| Kumru-2B | 50,176 | 35 | 3.5 |
|
| 72 |
+
| Turkcell-7B | 48,351 | 38 | 3.8 |
|
| 73 |
+
| Mistral-3.1 | 131,072 | 71 | 7.1 |
|
| 74 |
+
| Qwen-3 | 151,643 | 73 | 7.3 |
|
| 75 |
+
| GPT-2 | 50,257 | 105 | 10.5 |
|
| 76 |
+
|
| 77 |
+
Selected examples:
|
| 78 |
+
|
| 79 |
+
```text
|
| 80 |
+
güzelleştirilmiş
|
| 81 |
+
Multrenizer: güzel + leştirilmiş [2 tokens]
|
| 82 |
+
Kumru-2B: 2 tokens
|
| 83 |
+
Turkcell-7B: güzel + leştirilmiş [2 tokens]
|
| 84 |
+
Qwen-3: 5 tokens
|
| 85 |
+
Mistral-3.1: 5 tokens
|
| 86 |
+
GPT-2: 10 tokens
|
| 87 |
+
|
| 88 |
+
İstanbul'da
|
| 89 |
+
Multrenizer: istanbul + ' + da [3 tokens]
|
| 90 |
+
Kumru-2B: 3 tokens
|
| 91 |
+
Turkcell-7B: İstanbul + ' + da [3 tokens]
|
| 92 |
+
Qwen-3: 4 tokens
|
| 93 |
+
Mistral-3.1: 4 tokens
|
| 94 |
+
GPT-2: 5 tokens
|
| 95 |
+
|
| 96 |
+
Afyonkarahisarlılaştıramadıklarımızdan
|
| 97 |
+
Multrenizer: afyonkarahisar + lı + laştı + r + ama + dıkları + mızda + n [8 tokens]
|
| 98 |
+
Kumru-2B: 8 tokens
|
| 99 |
+
Turkcell-7B: 9 tokens
|
| 100 |
+
Qwen-3: 16 tokens
|
| 101 |
+
Mistral-3.1: 16 tokens
|
| 102 |
+
GPT-2: 21 tokens
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### Turkish I/i Normalization
|
| 106 |
+
|
| 107 |
+
This is the critical locale-sensitive test:
|
| 108 |
+
|
| 109 |
+
- `İ` must lowercase to `i`
|
| 110 |
+
- `I` must lowercase to `ı`
|
| 111 |
+
|
| 112 |
+
| Input | Expected | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|
| 113 |
+
|---|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 114 |
+
| İstanbul | istanbul | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
|
| 115 |
+
| IŞIK | ışık | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
|
| 116 |
+
| SIR | sır | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
|
| 117 |
+
| İNSAN | insan | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
|
| 118 |
+
| ISITMAK | ısıtmak | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
|
| 119 |
+
| **Score** | | **8/8** | **0/8** | **0/8** | **0/8** | **0/8** | **0/8** |
|
| 120 |
+
|
| 121 |
+
Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly.
|
| 122 |
+
|
| 123 |
+
### Code-Switching
|
| 124 |
+
|
| 125 |
+
```text
|
| 126 |
+
"Bu feature'ı implement ederken edge case'leri handle etmeyi unutmayalım."
|
| 127 |
+
|
| 128 |
+
Multrenizer [15 tok] bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lım | .
|
| 129 |
+
Kumru-2B [20 tok] Bu | fe | ature | ' | ı | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalım | .
|
| 130 |
+
Turkcell-7B [15 tok] Bu | feature | ' | ı | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalım | .
|
| 131 |
+
GPT-2 [24 tok] Bu | feature | ' | ı | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | ı | m | .
|
| 132 |
+
Qwen-3 [22 tok] Bu | feature | ' | ı | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alım | .
|
| 133 |
+
Mistral-3.1 [20 tok] Bu | feature | 'ı | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | ım | .
|
| 134 |
+
|
| 135 |
+
"merge'lemek istediğim branch conflict veriyor."
|
| 136 |
+
|
| 137 |
+
Multrenizer [ 8 tok] merge | ' | lemek | istediğim | branch | conflict | veriyor | .
|
| 138 |
+
Kumru-2B [14 tok] mer | ge | ' | lemek | istediÄŁim | b | ran | ch | con | f | lic | t | veriyor | .
|
| 139 |
+
Turkcell-7B [ 8 tok] merge | ' | lemek | istediğim | branch | conflict | veriyor | .
|
| 140 |
+
GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | ÄŁ | im | branch | conflict | ver | iy | or | .
|
| 141 |
+
Qwen-3 [11 tok] merge | ' | lem | ek | istediÄŁ | im | branch | conflict | ver | iyor | .
|
| 142 |
+
Mistral-3.1 [13 tok] merge | ' | le | mek | ist | edi | ÄŁ | im | branch | conflict | ver | iyor | .
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
## Quick Start
|
| 146 |
+
|
| 147 |
+
### Installation
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
git clone https://github.com/fzengin19/multrenizer.git
|
| 151 |
+
cd multrenizer
|
| 152 |
+
python -m venv .venv
|
| 153 |
+
source .venv/bin/activate
|
| 154 |
+
pip install -r requirements.txt
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
### Use the shipped tokenizer
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
from tokenizers import Tokenizer
|
| 161 |
+
|
| 162 |
+
tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")
|
| 163 |
+
|
| 164 |
+
encoded = tok.encode("İstanbul'da güzel bir gün")
|
| 165 |
+
print(encoded.tokens)
|
| 166 |
+
# ['<s>', 'istanbul', "'", 'da', 'güzel', 'bir', 'gün', '</s>']
|
| 167 |
+
|
| 168 |
+
print(tok.normalizer.normalize_str("IŞIK"))
|
| 169 |
+
# 'ışık'
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Train from scratch
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
# 1. Download and prepare corpus
|
| 176 |
+
python prepare_data.py --size medium
|
| 177 |
+
|
| 178 |
+
# 2. Train tokenizer
|
| 179 |
+
python train_tokenizer.py --data-dir data/
|
| 180 |
+
|
| 181 |
+
# 3. Optional: push tokenizer files to Hugging Face Hub
|
| 182 |
+
python train_tokenizer.py --data-dir data/ \
|
| 183 |
+
--repo-id your-username/multrenizer \
|
| 184 |
+
--hf-token hf_xxxxx
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
### Run benchmarks
|
| 188 |
+
|
| 189 |
+
```bash
|
| 190 |
+
python benchmark.py --tr-lines 5000 --en-lines 5000
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
## Architecture
|
| 194 |
+
|
| 195 |
+
### Pipeline
|
| 196 |
+
|
| 197 |
+
```text
|
| 198 |
+
Raw text
|
| 199 |
+
-> Turkish I/i normalizer (Replace: İ->i, I->ı, i̇->i)
|
| 200 |
+
-> Quote canonicalization (’ ‘ ʼ ' -> ')
|
| 201 |
+
-> NFKC normalization
|
| 202 |
+
-> Lowercase
|
| 203 |
+
-> Strip whitespace
|
| 204 |
+
-> Pre-tokenizer (whitespace + apostrophe + punctuation split)
|
| 205 |
+
-> Unigram model (~26K target vocab)
|
| 206 |
+
-> Post-processor (<s> ... </s>)
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
### Data Mix
|
| 210 |
+
|
| 211 |
+
The released artifact is trained with the default file-based interleave in `train_tokenizer.py`, which approximates:
|
| 212 |
+
|
| 213 |
+
| Stream | Share | Purpose |
|
| 214 |
+
|---|---|---|
|
| 215 |
+
| Turkish | ~60% | Core Turkish morphology |
|
| 216 |
+
| English | ~30% | English coverage |
|
| 217 |
+
| Code-switching | ~10% | TR-EN boundary handling |
|
| 218 |
+
|
| 219 |
+
Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
|
| 220 |
+
|
| 221 |
+
### Vocabulary Budget
|
| 222 |
+
|
| 223 |
+
Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
|
| 224 |
+
|
| 225 |
+
- `32` named special tokens
|
| 226 |
+
- `512` reserved tokens
|
| 227 |
+
- `292` utility tokens
|
| 228 |
+
- up to `25,164` learned subword tokens
|
| 229 |
+
|
| 230 |
+
Current shipped artifact: `25,917` total tokens.
|
| 231 |
+
|
| 232 |
+
### Special Tokens
|
| 233 |
+
|
| 234 |
+
| Category | IDs | Tokens | Purpose |
|
| 235 |
+
|---|---|---|---|
|
| 236 |
+
| **Core** | 0-3 | `<unk>` `<s>` `</s>` `<pad>` | Basic tokenizer operation |
|
| 237 |
+
| **Chat** | 4-8 | `<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|end\|>` `<\|sep\|>` | Instruction tuning and chat models |
|
| 238 |
+
| **Reasoning** | 9-12 | `<think>` `</think>` `<\|step\|>` `<\|reflection\|>` | Reasoning traces and self-check markers |
|
| 239 |
+
| **Tool Use** | 13-16 | `<tool_call>` `</tool_call>` `<tool_response>` `</tool_response>` | Tool and function calling |
|
| 240 |
+
| **Code/FIM** | 17-20 | `<\|code\|>` `<\|fim_prefix\|>` `<\|fim_middle\|>` `<\|fim_suffix\|>` | Code and fill-in-middle workflows |
|
| 241 |
+
| **Bilingual** | 21-22 | `<\|tr\|>` `<\|en\|>` | Language tags |
|
| 242 |
+
| **RAG** | 23-24 | `<\|context\|>` `<\|/context\|>` | Retrieval boundaries |
|
| 243 |
+
| **Multi-modal** | 25-28 | `<\|image\|>` `<\|audio\|>` `<\|video\|>` `<\|file\|>` | Placeholder tokens |
|
| 244 |
+
| **Structured** | 29-31 | `<\|json\|>` `<\|table\|>` `<\|cite\|>` | Structured output markers |
|
| 245 |
+
| **Reserved** | 32-543 | `<\|reserved_0\|>` ... `<\|reserved_511\|>` | Future growth without retraining |
|
| 246 |
+
| **Utility** | 544+ | Punctuation, emoji, math, currency, typography | Critical text symbols kept intact |
|
| 247 |
+
|
| 248 |
+
### Utility Tokens
|
| 249 |
+
|
| 250 |
+
| Category | Count | Examples |
|
| 251 |
+
|---|---:|---|
|
| 252 |
+
| Punctuation | 31 | `. , ! ? ; : - ( ) [ ] { } / \ " ' ...` |
|
| 253 |
+
| Currency & Business | 15 | `₺ $ € £ ¥ % @ # &` |
|
| 254 |
+
| Math & Science | 25 | `± × ÷ ≠ ≤ ≥ ∞ √ π α β γ` |
|
| 255 |
+
| Arrows & Symbols | 15 | `→ ← ↑ ↓ • ★ ☆ ✓ ✗ © ® ™` |
|
| 256 |
+
| Typography | 10 | `« » “ ” ‘ ’ ‹ › „ ‚` |
|
| 257 |
+
| Emoji (faces) | 70 | `😀 😂 🤣 😊 😍 🤔 😭 😡 💀 🤖` |
|
| 258 |
+
| Emoji (hands) | 28 | `👋 👍 👎 👏 🙏 💪 ✊ ✌️` |
|
| 259 |
+
| Emoji (hearts) | 18 | `❤️ 💛 💚 💙 💜 🖤 💔` |
|
| 260 |
+
| Emoji (symbols) | 36 | `🔥 ✨ ⭐ ✅ ❌ ⚠️ 💯 🚀` |
|
| 261 |
+
| Emoji (objects) | 36 | `💻 📱 🎯 🏆 📊 ☕ 🔗 💰` |
|
| 262 |
+
| Emoji (flags) | 8 | `🇹🇷 🇺🇸 🇬🇧 🇩🇪 🇫🇷 🇪🇸 🇮🇹 🇯🇵` |
|
| 263 |
+
|
| 264 |
+
## Project Structure
|
| 265 |
+
|
| 266 |
+
```text
|
| 267 |
+
multrenizer/
|
| 268 |
+
├── multrenizer-tokenizer/ # Trained tokenizer artifact
|
| 269 |
+
│ ├── tokenizer.json
|
| 270 |
+
│ ├── tokenizer_config.json
|
| 271 |
+
│ └── special_tokens_map.json
|
| 272 |
+
├── prepare_data.py # Corpus download and preparation
|
| 273 |
+
├── train_tokenizer.py # Tokenizer training script
|
| 274 |
+
├── benchmark.py # Benchmark against 5 reference tokenizers
|
| 275 |
+
├── benchmark_results.json # Full benchmark output
|
| 276 |
+
├── tests/ # Regression tests for tokenizer behavior
|
| 277 |
+
├── requirements.txt
|
| 278 |
+
└── pyproject.toml
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
## References
|
| 282 |
+
|
| 283 |
+
- [Tokens with Meaning: A Hybrid Tokenization Approach for Turkish](https://arxiv.org/html/2508.14292v2)
|
| 284 |
+
- [Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark](https://arxiv.org/html/2502.07057v1)
|
| 285 |
+
- [Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE](https://arxiv.org/abs/2508.08424)
|
| 286 |
+
- [Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration](https://blog.squeezebits.com/vocabulary-trimming-methods)
|
| 287 |
+
|
| 288 |
+
## License
|
| 289 |
+
|
| 290 |
+
Apache 2.0
|