multrenizer / README.md
fzengin18's picture
Add dataset metadata to model card
e1fe4bd verified
---
license: apache-2.0
language:
- tr
- en
library_name: transformers
tags:
- tokenizer
- tokenizers
- unigram
- turkish
- english
- bilingual
datasets:
- wikimedia/wikipedia
- Helsinki-NLP/opus-100
---
# Multrenizer
Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text.
## Links
- Repository: [github.com/fzengin19/multrenizer](https://github.com/fzengin19/multrenizer)
- Hugging Face: [huggingface.co/fzengin18/multrenizer](https://huggingface.co/fzengin18/multrenizer)
## Why Multrenizer?
Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless `I/i` rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols.
Core design goals:
- Turkish-aware normalization: hardcoded `ฤฐ -> i` and `I -> ฤฑ` before Unicode normalization
- Apostrophe preservation: forms like `feature'ฤฑ`, `merge'lemek`, `ฤฐstanbul'da`, and `can't` keep `'` as a real token
- Compact vocabulary budget: `~26K` target vocab for a Turkish-first bilingual tokenizer
- Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols
- Code-switching support: trained on mixed TR-EN text instead of treating it as noise
## Benchmark Results
Evaluated on `5,000` Turkish sentences, `5,000` English sentences, and `500` code-switching sentences from the prepared corpus against 5 reference tokenizers.
Notes:
- Multrenizer's shipped local artifact is auto-read from `multrenizer-tokenizer/tokenizer.json`; the current released artifact is `25,917` tokens.
- Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding.
### Compared Tokenizers
| Tokenizer | Source | Vocab Size | Algorithm | Type |
|---|---|---:|---|---|
| **Multrenizer** | This project | **25,917** | Unigram | Bilingual EN-TR, purpose-built |
| **Kumru-2B** | [vngrs-ai/Kumru-2B](https://huggingface.co/vngrs-ai/Kumru-2B) | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) |
| **Turkcell-7B** | [TURKCELL/Turkcell-LLM-7b-v1](https://huggingface.co/TURKCELL/Turkcell-LLM-7b-v1) | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) |
| **GPT-2** | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) | 50,257 | BPE | English-centric baseline (OpenAI, 2019) |
| **Qwen-3** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | 151,643 | BPE | Multilingual (Alibaba, 2025) |
| **Mistral-3.1** | [mistralai/Mistral-Small-3.1-24B-Base-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) |
### Fertility, Compression, and Token Count
Lower fertility means fewer tokens per word. Higher compression means more characters carried per token.
| Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| Vocab Size | **25,917** | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 |
| **TR Fertility** | **1.627** | 1.649 | 1.917 | 3.785 | 2.616 | 2.384 |
| EN Fertility | 1.525 | 2.151 | 1.555 | **1.314** | 1.372 | 1.381 |
| **CS Fertility** | **1.756** | 1.923 | 1.832 | 3.475 | 2.445 | 2.479 |
| **TR Compression** | **4.783** | 4.719 | 4.060 | 2.056 | 2.976 | 3.265 |
| EN Compression | 4.148 | 2.942 | 4.068 | **4.816** | 4.610 | 4.580 |
| **TR Total Tokens (5K)** | **130,844** | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 |
| EN Total Tokens (5K) | 157,027 | 221,420 | 160,121 | **135,235** | 141,275 | 142,196 |
| **CS Total Tokens (500)** | **5,525** | 6,050 | 5,762 | 10,933 | 7,693 | 7,799 |
Current position:
- Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens
- Best code-switching efficiency in this comparison set: CS fertility and CS total tokens
- Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count
- Only tokenizer here that passes Turkish `I/i` normalization correctly
### Morphological Splitting
Total tokens needed to represent 10 difficult Turkish words:
| Tokenizer | Vocab Size | Total Tokens | Avg per Word |
|---|---:|:---:|:---:|
| **Multrenizer** | **25,917** | **32** | **3.2** |
| Kumru-2B | 50,176 | 35 | 3.5 |
| Turkcell-7B | 48,351 | 38 | 3.8 |
| Mistral-3.1 | 131,072 | 71 | 7.1 |
| Qwen-3 | 151,643 | 73 | 7.3 |
| GPT-2 | 50,257 | 105 | 10.5 |
Selected examples:
```text
gรผzelleลŸtirilmiลŸ
Multrenizer: gรผzel + leลŸtirilmiลŸ [2 tokens]
Kumru-2B: gรƒยผzel + leร…ลtirilmiร…ล [2 tokens]
Turkcell-7B: gรผzel + leลŸtirilmiลŸ [2 tokens]
Qwen-3: g + รƒยผz + elle + ร…ลtir + ilmiร…ล [5 tokens]
Mistral-3.1: g + รƒยผz + elle + ร…ลtir + ilmiร…ล [5 tokens]
GPT-2: g + รƒยผ + z + elle + ร…ล + t + ir + il + mi + ร…ล [10 tokens]
ฤฐstanbul'da
Multrenizer: istanbul + ' + da [3 tokens]
Kumru-2B: ร„ยฐstanbul + ' + da [3 tokens]
Turkcell-7B: ฤฐstanbul + ' + da [3 tokens]
Qwen-3: ร„ยฐ + stanbul + 'd + a [4 tokens]
Mistral-3.1: ร„ยฐ + stanbul + 'd + a [4 tokens]
GPT-2: ร„ + ยฐ + stanbul + 'd + a [5 tokens]
AfyonkarahisarlฤฑlaลŸtฤฑramadฤฑklarฤฑmฤฑzdan
Multrenizer: afyonkarahisar + lฤฑ + laลŸtฤฑ + r + ama + dฤฑklarฤฑ + mฤฑzda + n [8 tokens]
Kumru-2B: Af + yonkarahisar + lร„ยฑ + laร…ลtร„ยฑr + ama + dร„ยฑk + larร„ยฑmร„ยฑz + dan [8 tokens]
Turkcell-7B: Afyon + kar + ah + is + arlฤฑ + laลŸtฤฑr + a + madฤฑk + larฤฑmฤฑzdan [9 tokens]
Qwen-3: Af + yon + kar + ah + is + ar + lร„ยฑ + la + ร…ลt + ร„ยฑ + ram + ad + ร„ยฑkl + ar + ร„ยฑmร„ยฑz + dan [16 tokens]
Mistral-3.1: Af + yon + kar + ah + is + arl + ร„ยฑ + laร…ลt + ร„ยฑ + ram + ad + ร„ยฑklarร„ยฑ + m + ร„ยฑ + zd + an [16 tokens]
GPT-2: Af + yon + kar + ah + is + arl + ร„ยฑ + la + ร…ล + t + ร„ยฑ + ram + ad + ร„ยฑ + k + lar + ร„ยฑ + m + ร„ยฑ + z + dan [21 tokens]
```
### Turkish I/i Normalization
This is the critical locale-sensitive test:
- `ฤฐ` must lowercase to `i`
- `I` must lowercase to `ฤฑ`
| Input | Expected | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|---|---|:---:|:---:|:---:|:---:|:---:|:---:|
| ฤฐstanbul | istanbul | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
| IลžIK | ฤฑลŸฤฑk | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
| SIR | sฤฑr | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
| ฤฐNSAN | insan | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
| ISITMAK | ฤฑsฤฑtmak | **OK** | FAIL | FAIL | FAIL | FAIL | FAIL |
| **Score** | | **8/8** | **0/8** | **0/8** | **0/8** | **0/8** | **0/8** |
Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly.
### Code-Switching
```text
"Bu feature'ฤฑ implement ederken edge case'leri handle etmeyi unutmayalฤฑm."
Multrenizer [15 tok] bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lฤฑm | .
Kumru-2B [20 tok] Bu | fe | ature | ' | ร„ยฑ | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalร„ยฑm | .
Turkcell-7B [15 tok] Bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalฤฑm | .
GPT-2 [24 tok] Bu | feature | ' | ร„ยฑ | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | ร„ยฑ | m | .
Qwen-3 [22 tok] Bu | feature | ' | ร„ยฑ | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alร„ยฑm | .
Mistral-3.1 [20 tok] Bu | feature | 'ร„ยฑ | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | ร„ยฑm | .
"merge'lemek istediฤŸim branch conflict veriyor."
Multrenizer [ 8 tok] merge | ' | lemek | istediฤŸim | branch | conflict | veriyor | .
Kumru-2B [14 tok] mer | ge | ' | lemek | istediร„ลim | b | ran | ch | con | f | lic | t | veriyor | .
Turkcell-7B [ 8 tok] merge | ' | lemek | istediฤŸim | branch | conflict | veriyor | .
GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | ร„ล | im | branch | conflict | ver | iy | or | .
Qwen-3 [11 tok] merge | ' | lem | ek | istediร„ล | im | branch | conflict | ver | iyor | .
Mistral-3.1 [13 tok] merge | ' | le | mek | ist | edi | ร„ล | im | branch | conflict | ver | iyor | .
```
## Quick Start
### Installation
```bash
git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Use the shipped tokenizer locally
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")
encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']
print(tok.normalizer.normalize_str("IลžIK"))
# 'ฤฑลŸฤฑk'
```
### Load from Hugging Face
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("fzengin18/multrenizer")
encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']
```
If you use `transformers`, this also works:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
print(tok.tokenize("ฤฐstanbul'da gรผzel bir gรผn"))
```
### Train from scratch
```bash
# 1. Download and prepare corpus
python prepare_data.py --size medium
# 2. Train tokenizer
python train_tokenizer.py --data-dir data/
# 3. Optional: push tokenizer files to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
--repo-id fzengin18/multrenizer \
--hf-token "$HF_TOKEN"
```
### Run benchmarks
```bash
python benchmark.py --tr-lines 5000 --en-lines 5000
```
## Architecture
### Pipeline
```text
Raw text
-> Turkish I/i normalizer (Replace: ฤฐ->i, I->ฤฑ, iฬ‡->i)
-> Quote canonicalization (โ€™ โ€˜ สผ ๏ผ‡ -> ')
-> NFKC normalization
-> Lowercase
-> Strip whitespace
-> Pre-tokenizer (whitespace + apostrophe + punctuation split)
-> Unigram model (~26K target vocab)
-> Post-processor (<s> ... </s>)
```
### Data Mix
The released artifact is trained with the default file-based interleave in `train_tokenizer.py`, which approximates:
| Stream | Share | Purpose |
|---|---|---|
| Turkish | ~60% | Core Turkish morphology |
| English | ~30% | English coverage |
| Code-switching | ~10% | TR-EN boundary handling |
Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
Exact source configs used during corpus preparation:
- `wikimedia/wikipedia` with `20231101.tr`
- `wikimedia/wikipedia` with `20231101.en`
- `Helsinki-NLP/opus-100` with `en-tr`
The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.
### Vocabulary Budget
Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
- `32` named special tokens
- `512` reserved tokens
- `292` utility tokens
- up to `25,164` learned subword tokens
Current shipped artifact: `25,917` total tokens.
### Special Tokens
| Category | IDs | Tokens | Purpose |
|---|---|---|---|
| **Core** | 0-3 | `<unk>` `<s>` `</s>` `<pad>` | Basic tokenizer operation |
| **Chat** | 4-8 | `<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|end\|>` `<\|sep\|>` | Instruction tuning and chat models |
| **Reasoning** | 9-12 | `<think>` `</think>` `<\|step\|>` `<\|reflection\|>` | Reasoning traces and self-check markers |
| **Tool Use** | 13-16 | `<tool_call>` `</tool_call>` `<tool_response>` `</tool_response>` | Tool and function calling |
| **Code/FIM** | 17-20 | `<\|code\|>` `<\|fim_prefix\|>` `<\|fim_middle\|>` `<\|fim_suffix\|>` | Code and fill-in-middle workflows |
| **Bilingual** | 21-22 | `<\|tr\|>` `<\|en\|>` | Language tags |
| **RAG** | 23-24 | `<\|context\|>` `<\|/context\|>` | Retrieval boundaries |
| **Multi-modal** | 25-28 | `<\|image\|>` `<\|audio\|>` `<\|video\|>` `<\|file\|>` | Placeholder tokens |
| **Structured** | 29-31 | `<\|json\|>` `<\|table\|>` `<\|cite\|>` | Structured output markers |
| **Reserved** | 32-543 | `<\|reserved_0\|>` ... `<\|reserved_511\|>` | Future growth without retraining |
| **Utility** | 544+ | Punctuation, emoji, math, currency, typography | Critical text symbols kept intact |
### Utility Tokens
| Category | Count | Examples |
|---|---:|---|
| Punctuation | 31 | `. , ! ? ; : - ( ) [ ] { } / \ " ' ...` |
| Currency & Business | 15 | `โ‚บ $ โ‚ฌ ยฃ ยฅ % @ # &` |
| Math & Science | 25 | `ยฑ ร— รท โ‰  โ‰ค โ‰ฅ โˆž โˆš ฯ€ ฮฑ ฮฒ ฮณ` |
| Arrows & Symbols | 15 | `โ†’ โ† โ†‘ โ†“ โ€ข โ˜… โ˜† โœ“ โœ— ยฉ ยฎ โ„ข` |
| Typography | 10 | `ยซ ยป โ€œ โ€ โ€˜ โ€™ โ€น โ€บ โ€ž โ€š` |
| Emoji (faces) | 70 | `๐Ÿ˜€ ๐Ÿ˜‚ ๐Ÿคฃ ๐Ÿ˜Š ๐Ÿ˜ ๐Ÿค” ๐Ÿ˜ญ ๐Ÿ˜ก ๐Ÿ’€ ๐Ÿค–` |
| Emoji (hands) | 28 | `๐Ÿ‘‹ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘ ๐Ÿ™ ๐Ÿ’ช โœŠ โœŒ๏ธ` |
| Emoji (hearts) | 18 | `โค๏ธ ๐Ÿ’› ๐Ÿ’š ๐Ÿ’™ ๐Ÿ’œ ๐Ÿ–ค ๐Ÿ’”` |
| Emoji (symbols) | 36 | `๐Ÿ”ฅ โœจ โญ โœ… โŒ โš ๏ธ ๐Ÿ’ฏ ๐Ÿš€` |
| Emoji (objects) | 36 | `๐Ÿ’ป ๐Ÿ“ฑ ๐ŸŽฏ ๐Ÿ† ๐Ÿ“Š โ˜• ๐Ÿ”— ๐Ÿ’ฐ` |
| Emoji (flags) | 8 | `๐Ÿ‡น๐Ÿ‡ท ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ช๐Ÿ‡ธ ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡ฏ๐Ÿ‡ต` |
## Project Structure
```text
multrenizer/
โ”œโ”€โ”€ multrenizer-tokenizer/ # Trained tokenizer artifact
โ”‚ โ”œโ”€โ”€ tokenizer.json
โ”‚ โ”œโ”€โ”€ tokenizer_config.json
โ”‚ โ””โ”€โ”€ special_tokens_map.json
โ”œโ”€โ”€ prepare_data.py # Corpus download and preparation
โ”œโ”€โ”€ train_tokenizer.py # Tokenizer training script
โ”œโ”€โ”€ benchmark.py # Benchmark against 5 reference tokenizers
โ”œโ”€โ”€ benchmark_results.json # Full benchmark output
โ”œโ”€โ”€ tests/ # Regression tests for tokenizer behavior
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ pyproject.toml
```
## References
- [Tokens with Meaning: A Hybrid Tokenization Approach for Turkish](https://arxiv.org/html/2508.14292v2)
- [Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark](https://arxiv.org/html/2502.07057v1)
- [Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE](https://arxiv.org/abs/2508.08424)
- [Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration](https://blog.squeezebits.com/vocabulary-trimming-methods)
## License
Apache 2.0