tokenizer-256k / README.md
timpal0l's picture
Update README.md
ac1ce77 verified
---
language:
- bg
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- gl
- hr
- hu
- is
- it
- lb
- lt
- lv
- mk
- mt
- nl
- "no"
- pl
- pt
- ro
- ru
- sk
- sl
- sq
- sr
- sv
- tr
- uk
tags:
- tokenizer
- sentencepiece
- bpe
- multilingual
- european-languages
license: apache-2.0
---
# OpenEuroLLM Tokenizer (256k)
A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC.
## Key Results
- **Best fertility (tokens per word) on 26 out of 38 European languages**, outperforming Llama 3.2, Gemma 3, GPT-OSS, EuroLLM, Qwen 2.5, DeepSeek V3, and Mistral v0.3
- **Lowest average fertility overall: 2.12** across 38 languages
- Particularly strong on lower-resource EU languages: Lithuanian (31% fewer tokens than Llama), Hungarian (22%), Icelandic (26%), Maltese (25%)
- English fertility (1.58) within 5% of best (GPT-OSS, 1.51) — no trade-off
## Usage
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k")
text = "Hello world! Bonjour le monde. Hej världen!"
ids = tok(text)["input_ids"]
decoded = tok.decode(ids, skip_special_tokens=True)
print(f"Tokens: {len(ids)}")
print(f"Decoded: {decoded}")
```
### Batch encoding
```python
texts = [
"The quick brown fox jumps over the lazy dog.",
"Der schnelle braune Fuchs springt über den faulen Hund.",
"Le rapide renard brun saute par-dessus le chien paresseux.",
]
batch = tok(texts, padding=True, return_tensors="pt")
print(batch["input_ids"].shape) # (3, max_len)
```
### Special Tokens
| Token | ID | Purpose |
|-------|------|---------|
| `<unk>` | 0 | Unknown |
| `<bos>` | 1 | Beginning of sequence |
| `<eos>` | 2 | End of sequence |
| `<start_of_turn>` | 3 | Chat turn start |
| `<end_of_turn>` | 4 | Chat turn end |
| `<start_of_image>` | 5 | Image start |
| `<end_of_image>` | 6 | Image end |
| `<image_soft_token>` | 7 | Image placeholder |
| `<fim_prefix>` | 8 | Fill-in-middle prefix |
| `<fim_middle>` | 9 | Fill-in-middle middle |
| `<fim_suffix>` | 10 | Fill-in-middle suffix |
| `<tool_call>` | 11 | Tool call start |
| `</tool_call>` | 12 | Tool call end |
| `<unused_0>``<unused_99>` | 13–112 | Reserved for future use |
| `<pad>` | 262,144 | Padding |
## Training Details
| Parameter | Value |
|-----------|-------|
| **Algorithm** | BPE (SentencePiece) |
| **Vocabulary size** | 262,144 |
| **Training data** | 173 GB multilingual corpus |
| **Data mix** | 70% English, 10% code/math, 20% other languages (37 languages) |
| **Character coverage** | 0.9995 |
| **Normalization** | Identity (lossless) |
| **Byte fallback** | Enabled |
| **Digit splitting** | Enabled |
| **Max piece length** | 16 |
| **Trained on** | LUMI HPC (CSC, Finland) |
| **Training time** | ~9 hours (32 CPUs, 128 GB RAM) |
### Data Sources
The training corpus aggregates cleaned/deduplicated text from: C4, FineWeb-2, Nemotron-CC, MADLAD-400, HPLT, FinePDFs, German-Commons, StarCoder, Proof-Pile-2, Cosmopedia-v2, and FineMath.
### Languages (37 + English)
**EU Official (23):** bg, hr, cs, da, nl, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv
**Additional European (14):** sq, eu, bs, ca, gl, is, lb, mk, no, ru, sr, tr, uk, cy
## Fertility Evaluation
Average tokens per word across 38 European languages (lower = better), evaluated on 200 Wikipedia articles per language:
| Tokenizer | Vocab | Avg Fertility | Languages Won |
|-----------|-------|---------------|---------------|
| **Ours 262k** | **262k** | **2.12** | **26** |
| GPT-OSS 20B | 200k | 2.26 | 8 |
| EuroLLM 1.7B | 128k | 2.27 | 3 |
| Ours 128k | 131k | 2.31 | 0 |
| Gemma 3 4B | 262k | 2.35 | 0 |
| DeepSeek V3 | 129k | 2.52 | 0 |
| Llama 3.2 1B | 128k | 2.56 | 1 |
| Qwen 2.5 | 152k | 2.83 | 0 |
| Mistral v0.3 | 33k | 2.97 | 0 |
## See Also
- [openeurollm/tokenizer-128k](https://huggingface.co/openeurollm/tokenizer-128k) — 128k vocab variant (half the vocabulary, ~9% more tokens per word)