| --- |
| license: apache-2.0 |
| language: |
| - tr |
| - en |
| library_name: transformers |
| tags: |
| - tokenizer |
| - tokenizers |
| - unigram |
| - turkish |
| - english |
| - bilingual |
| datasets: |
| - wikimedia/wikipedia |
| - Helsinki-NLP/opus-100 |
| --- |
| |
| # Multrenizer |
|
|
| Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text. |
|
|
| ## Links |
|
|
| - Repository: [github.com/fzengin19/multrenizer](https://github.com/fzengin19/multrenizer) |
| - Hugging Face: [huggingface.co/fzengin18/multrenizer](https://huggingface.co/fzengin18/multrenizer) |
|
|
| ## Why Multrenizer? |
|
|
| Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless `I/i` rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols. |
|
|
| Core design goals: |
|
|
| - Turkish-aware normalization: hardcoded `ฤฐ -> i` and `I -> ฤฑ` before Unicode normalization |
| - Apostrophe preservation: forms like `feature'ฤฑ`, `merge'lemek`, `ฤฐstanbul'da`, and `can't` keep `'` as a real token |
| - Compact vocabulary budget: `~26K` target vocab for a Turkish-first bilingual tokenizer |
| - Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols |
| - Code-switching support: trained on mixed TR-EN text instead of treating it as noise |
|
|
| ## Benchmark Results |
|
|
| Evaluated on `5,000` Turkish sentences, `5,000` English sentences, and `500` code-switching sentences from the prepared corpus against 5 reference tokenizers. |
|
|
| Notes: |
|
|
| - Multrenizer's shipped local artifact is auto-read from `multrenizer-tokenizer/tokenizer.json`; the current released artifact is `25,917` tokens. |
| - Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding. |
|
|
| ### Compared Tokenizers |
|
|
| | Tokenizer | Source | Vocab Size | Algorithm | Type | |
| |---|---|---:|---|---| |
| | **Multrenizer** | This project | **25,917** | Unigram | Bilingual EN-TR, purpose-built | |
| | **Kumru-2B** | [vngrs-ai/Kumru-2B](https://huggingface.co/vngrs-ai/Kumru-2B) | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) | |
| | **Turkcell-7B** | [TURKCELL/Turkcell-LLM-7b-v1](https://huggingface.co/TURKCELL/Turkcell-LLM-7b-v1) | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) | |
| | **GPT-2** | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | 50,257 | BPE | English-centric baseline (OpenAI, 2019) | |
| | **Qwen-3** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | 151,643 | BPE | Multilingual (Alibaba, 2025) | |
| | **Mistral-3.1** | [mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) | |
|
|
| ### Fertility, Compression, and Token Count |
|
|
| Lower fertility means fewer tokens per word. Higher compression means more characters carried per token. |
|
|
| | Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 | |
| |---|:---:|:---:|:---:|:---:|:---:|:---:| |
| | Vocab Size | **25,917** | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 | |
| | **TR Fertility** | **1.627** | 1.649 | 1.917 | 3.785 | 2.616 | 2.384 | |
| | EN Fertility | 1.525 | 2.151 | 1.555 | **1.314** | 1.372 | 1.381 | |
| | **CS Fertility** | **1.756** | 1.923 | 1.832 | 3.475 | 2.445 | 2.479 | |
| | **TR Compression** | **4.783** | 4.719 | 4.060 | 2.056 | 2.976 | 3.265 | |
| | EN Compression | 4.148 | 2.942 | 4.068 | **4.816** | 4.610 | 4.580 | |
| | **TR Total Tokens (5K)** | **130,844** | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 | |
| | EN Total Tokens (5K) | 157,027 | 221,420 | 160,121 | **135,235** | 141,275 | 142,196 | |
| | **CS Total Tokens (500)** | **5,525** | 6,050 | 5,762 | 10,933 | 7,693 | 7,799 | |
|
|
| Current position: |
|
|
| - Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens |
| - Best code-switching efficiency in this comparison set: CS fertility and CS total tokens |
| - Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count |
| - Only tokenizer here that passes Turkish `I/i` normalization correctly |
|
|
| ### Morphological Splitting |
|
|
| Total tokens needed to represent 10 difficult Turkish words: |
|
|
| | Tokenizer | Vocab Size | Total Tokens | Avg per Word | |
| |---|---:|:---:|:---:| |
| | **Multrenizer** | **25,917** | **32** | **3.2** | |
| | Kumru-2B | 50,176 | 35 | 3.5 | |
| | Turkcell-7B | 48,351 | 38 | 3.8 | |
| | Mistral-3.1 | 131,072 | 71 | 7.1 | |
| | Qwen-3 | 151,643 | 73 | 7.3 | |
| | GPT-2 | 50,257 | 105 | 10.5 | |
|
|
| Selected examples: |
|
|
| ```text |
| gรผzelleลtirilmiล |
| Multrenizer: gรผzel + leลtirilmiล [2 tokens] |
| Kumru-2B: gรยผzel + leร
ลtirilmiร
ล [2 tokens] |
| Turkcell-7B: gรผzel + leลtirilmiล [2 tokens] |
| Qwen-3: g + รยผz + elle + ร
ลtir + ilmiร
ล [5 tokens] |
| Mistral-3.1: g + รยผz + elle + ร
ลtir + ilmiร
ล [5 tokens] |
| GPT-2: g + รยผ + z + elle + ร
ล + t + ir + il + mi + ร
ล [10 tokens] |
| |
| ฤฐstanbul'da |
| Multrenizer: istanbul + ' + da [3 tokens] |
| Kumru-2B: รยฐstanbul + ' + da [3 tokens] |
| Turkcell-7B: ฤฐstanbul + ' + da [3 tokens] |
| Qwen-3: รยฐ + stanbul + 'd + a [4 tokens] |
| Mistral-3.1: รยฐ + stanbul + 'd + a [4 tokens] |
| GPT-2: ร + ยฐ + stanbul + 'd + a [5 tokens] |
| |
| Afyonkarahisarlฤฑlaลtฤฑramadฤฑklarฤฑmฤฑzdan |
| Multrenizer: afyonkarahisar + lฤฑ + laลtฤฑ + r + ama + dฤฑklarฤฑ + mฤฑzda + n [8 tokens] |
| Kumru-2B: Af + yonkarahisar + lรยฑ + laร
ลtรยฑr + ama + dรยฑk + larรยฑmรยฑz + dan [8 tokens] |
| Turkcell-7B: Afyon + kar + ah + is + arlฤฑ + laลtฤฑr + a + madฤฑk + larฤฑmฤฑzdan [9 tokens] |
| Qwen-3: Af + yon + kar + ah + is + ar + lรยฑ + la + ร
ลt + รยฑ + ram + ad + รยฑkl + ar + รยฑmรยฑz + dan [16 tokens] |
| Mistral-3.1: Af + yon + kar + ah + is + arl + รยฑ + laร
ลt + รยฑ + ram + ad + รยฑklarรยฑ + m + รยฑ + zd + an [16 tokens] |
| GPT-2: Af + yon + kar + ah + is + arl + รยฑ + la + ร
ล + t + รยฑ + ram + ad + รยฑ + k + lar + รยฑ + m + รยฑ + z + dan [21 tokens] |
| ``` |
|
|
| ### Turkish I/i Normalization |
|
|
| This is the critical locale-sensitive test: |
|
|
| - `ฤฐ` must lowercase to `i` |
| - `I` must lowercase to `ฤฑ` |
|
|
| | Input | Expected | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 | |
| |---|---|:---:|:---:|:---:|:---:|:---:|:---:| |
| | ฤฐstanbul | istanbul | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL | |
| | IลIK | ฤฑลฤฑk | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL | |
| | SIR | sฤฑr | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL | |
| | ฤฐNSAN | insan | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL | |
| | ISITMAK | ฤฑsฤฑtmak | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL | |
| | **Score** | | **8/8** | **0/8** | **0/8** | **0/8** | **0/8** | **0/8** | |
|
|
| Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly. |
|
|
| ### Code-Switching |
|
|
| ```text |
| "Bu feature'ฤฑ implement ederken edge case'leri handle etmeyi unutmayalฤฑm." |
| |
| Multrenizer [15 tok] bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lฤฑm | . |
| Kumru-2B [20 tok] Bu | fe | ature | ' | รยฑ | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalรยฑm | . |
| Turkcell-7B [15 tok] Bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalฤฑm | . |
| GPT-2 [24 tok] Bu | feature | ' | รยฑ | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | รยฑ | m | . |
| Qwen-3 [22 tok] Bu | feature | ' | รยฑ | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alรยฑm | . |
| Mistral-3.1 [20 tok] Bu | feature | 'รยฑ | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | รยฑm | . |
| |
| "merge'lemek istediฤim branch conflict veriyor." |
| |
| Multrenizer [ 8 tok] merge | ' | lemek | istediฤim | branch | conflict | veriyor | . |
| Kumru-2B [14 tok] mer | ge | ' | lemek | istediรลim | b | ran | ch | con | f | lic | t | veriyor | . |
| Turkcell-7B [ 8 tok] merge | ' | lemek | istediฤim | branch | conflict | veriyor | . |
| GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | รล | im | branch | conflict | ver | iy | or | . |
| Qwen-3 [11 tok] merge | ' | lem | ek | istediรล | im | branch | conflict | ver | iyor | . |
| Mistral-3.1 [13 tok] merge | ' | le | mek | ist | edi | รล | im | branch | conflict | ver | iyor | . |
| ``` |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/fzengin19/multrenizer.git |
| cd multrenizer |
| python -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Use the shipped tokenizer locally |
|
|
| ```python |
| from tokenizers import Tokenizer |
| |
| tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json") |
| |
| encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn") |
| print(encoded.tokens) |
| # ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>'] |
| |
| print(tok.normalizer.normalize_str("IลIK")) |
| # 'ฤฑลฤฑk' |
| ``` |
|
|
| ### Load from Hugging Face |
|
|
| ```python |
| from tokenizers import Tokenizer |
| |
| tok = Tokenizer.from_pretrained("fzengin18/multrenizer") |
| |
| encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn") |
| print(encoded.tokens) |
| # ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>'] |
| ``` |
|
|
| If you use `transformers`, this also works: |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer") |
| print(tok.tokenize("ฤฐstanbul'da gรผzel bir gรผn")) |
| ``` |
|
|
| ### Train from scratch |
|
|
| ```bash |
| # 1. Download and prepare corpus |
| python prepare_data.py --size medium |
| |
| # 2. Train tokenizer |
| python train_tokenizer.py --data-dir data/ |
| |
| # 3. Optional: push tokenizer files to Hugging Face Hub |
| python train_tokenizer.py --data-dir data/ \ |
| --repo-id fzengin18/multrenizer \ |
| --hf-token "$HF_TOKEN" |
| ``` |
|
|
| ### Run benchmarks |
|
|
| ```bash |
| python benchmark.py --tr-lines 5000 --en-lines 5000 |
| ``` |
|
|
| ## Architecture |
|
|
| ### Pipeline |
|
|
| ```text |
| Raw text |
| -> Turkish I/i normalizer (Replace: ฤฐ->i, I->ฤฑ, iฬ->i) |
| -> Quote canonicalization (โ โ สผ ๏ผ -> ') |
| -> NFKC normalization |
| -> Lowercase |
| -> Strip whitespace |
| -> Pre-tokenizer (whitespace + apostrophe + punctuation split) |
| -> Unigram model (~26K target vocab) |
| -> Post-processor (<s> ... </s>) |
| ``` |
|
|
| ### Data Mix |
|
|
| The released artifact is trained with the default file-based interleave in `train_tokenizer.py`, which approximates: |
|
|
| | Stream | Share | Purpose | |
| |---|---|---| |
| | Turkish | ~60% | Core Turkish morphology | |
| | English | ~30% | English coverage | |
| | Code-switching | ~10% | TR-EN boundary handling | |
|
|
| Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation. |
|
|
| Exact source configs used during corpus preparation: |
|
|
| - `wikimedia/wikipedia` with `20231101.tr` |
| - `wikimedia/wikipedia` with `20231101.en` |
| - `Helsinki-NLP/opus-100` with `en-tr` |
|
|
| The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry. |
|
|
| ### Vocabulary Budget |
|
|
| Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens: |
|
|
| - `32` named special tokens |
| - `512` reserved tokens |
| - `292` utility tokens |
| - up to `25,164` learned subword tokens |
|
|
| Current shipped artifact: `25,917` total tokens. |
|
|
| ### Special Tokens |
|
|
| | Category | IDs | Tokens | Purpose | |
| |---|---|---|---| |
| | **Core** | 0-3 | `<unk>` `<s>` `</s>` `<pad>` | Basic tokenizer operation | |
| | **Chat** | 4-8 | `<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|end\|>` `<\|sep\|>` | Instruction tuning and chat models | |
| | **Reasoning** | 9-12 | `<think>` `</think>` `<\|step\|>` `<\|reflection\|>` | Reasoning traces and self-check markers | |
| | **Tool Use** | 13-16 | `<tool_call>` `</tool_call>` `<tool_response>` `</tool_response>` | Tool and function calling | |
| | **Code/FIM** | 17-20 | `<\|code\|>` `<\|fim_prefix\|>` `<\|fim_middle\|>` `<\|fim_suffix\|>` | Code and fill-in-middle workflows | |
| | **Bilingual** | 21-22 | `<\|tr\|>` `<\|en\|>` | Language tags | |
| | **RAG** | 23-24 | `<\|context\|>` `<\|/context\|>` | Retrieval boundaries | |
| | **Multi-modal** | 25-28 | `<\|image\|>` `<\|audio\|>` `<\|video\|>` `<\|file\|>` | Placeholder tokens | |
| | **Structured** | 29-31 | `<\|json\|>` `<\|table\|>` `<\|cite\|>` | Structured output markers | |
| | **Reserved** | 32-543 | `<\|reserved_0\|>` ... `<\|reserved_511\|>` | Future growth without retraining | |
| | **Utility** | 544+ | Punctuation, emoji, math, currency, typography | Critical text symbols kept intact | |
|
|
| ### Utility Tokens |
|
|
| | Category | Count | Examples | |
| |---|---:|---| |
| | Punctuation | 31 | `. , ! ? ; : - ( ) [ ] { } / \ " ' ...` | |
| | Currency & Business | 15 | `โบ $ โฌ ยฃ ยฅ % @ # &` | |
| | Math & Science | 25 | `ยฑ ร รท โ โค โฅ โ โ ฯ ฮฑ ฮฒ ฮณ` | |
| | Arrows & Symbols | 15 | `โ โ โ โ โข โ
โ โ โ ยฉ ยฎ โข` | |
| | Typography | 10 | `ยซ ยป โ โ โ โ โน โบ โ โ` | |
| | Emoji (faces) | 70 | `๐ ๐ ๐คฃ ๐ ๐ ๐ค ๐ญ ๐ก ๐ ๐ค` | |
| | Emoji (hands) | 28 | `๐ ๐ ๐ ๐ ๐ ๐ช โ โ๏ธ` | |
| | Emoji (hearts) | 18 | `โค๏ธ ๐ ๐ ๐ ๐ ๐ค ๐` | |
| | Emoji (symbols) | 36 | `๐ฅ โจ โญ โ
โ โ ๏ธ ๐ฏ ๐` | |
| | Emoji (objects) | 36 | `๐ป ๐ฑ ๐ฏ ๐ ๐ โ ๐ ๐ฐ` | |
| | Emoji (flags) | 8 | `๐น๐ท ๐บ๐ธ ๐ฌ๐ง ๐ฉ๐ช ๐ซ๐ท ๐ช๐ธ ๐ฎ๐น ๐ฏ๐ต` | |
|
|
| ## Project Structure |
|
|
| ```text |
| multrenizer/ |
| โโโ multrenizer-tokenizer/ # Trained tokenizer artifact |
| โ โโโ tokenizer.json |
| โ โโโ tokenizer_config.json |
| โ โโโ special_tokens_map.json |
| โโโ prepare_data.py # Corpus download and preparation |
| โโโ train_tokenizer.py # Tokenizer training script |
| โโโ benchmark.py # Benchmark against 5 reference tokenizers |
| โโโ benchmark_results.json # Full benchmark output |
| โโโ tests/ # Regression tests for tokenizer behavior |
| โโโ requirements.txt |
| โโโ pyproject.toml |
| ``` |
|
|
| ## References |
|
|
| - [Tokens with Meaning: A Hybrid Tokenization Approach for Turkish](https://arxiv.org/html/2508.14292v2) |
| - [Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark](https://arxiv.org/html/2502.07057v1) |
| - [Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE](https://arxiv.org/abs/2508.08424) |
| - [Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration](https://blog.squeezebits.com/vocabulary-trimming-methods) |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|