| | --- |
| | language: |
| | - bg |
| | - bs |
| | - ca |
| | - cs |
| | - cy |
| | - da |
| | - de |
| | - el |
| | - en |
| | - es |
| | - et |
| | - eu |
| | - fi |
| | - fr |
| | - ga |
| | - gl |
| | - hr |
| | - hu |
| | - is |
| | - it |
| | - lb |
| | - lt |
| | - lv |
| | - mk |
| | - mt |
| | - nl |
| | - "no" |
| | - pl |
| | - pt |
| | - ro |
| | - ru |
| | - sk |
| | - sl |
| | - sq |
| | - sr |
| | - sv |
| | - tr |
| | - uk |
| | tags: |
| | - tokenizer |
| | - sentencepiece |
| | - bpe |
| | - multilingual |
| | - european-languages |
| | license: apache-2.0 |
| | --- |
| | |
| | # OpenEuroLLM Tokenizer (256k) |
| |
|
| | A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC. |
| |
|
| | ## Key Results |
| |
|
| | - **Best fertility (tokens per word) on 26 out of 38 European languages**, outperforming Llama 3.2, Gemma 3, GPT-OSS, EuroLLM, Qwen 2.5, DeepSeek V3, and Mistral v0.3 |
| | - **Lowest average fertility overall: 2.12** across 38 languages |
| | - Particularly strong on lower-resource EU languages: Lithuanian (31% fewer tokens than Llama), Hungarian (22%), Icelandic (26%), Maltese (25%) |
| | - English fertility (1.58) within 5% of best (GPT-OSS, 1.51) — no trade-off |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer |
| | |
| | tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k") |
| | |
| | text = "Hello world! Bonjour le monde. Hej världen!" |
| | ids = tok(text)["input_ids"] |
| | decoded = tok.decode(ids, skip_special_tokens=True) |
| | |
| | print(f"Tokens: {len(ids)}") |
| | print(f"Decoded: {decoded}") |
| | ``` |
| |
|
| | ### Batch encoding |
| |
|
| | ```python |
| | texts = [ |
| | "The quick brown fox jumps over the lazy dog.", |
| | "Der schnelle braune Fuchs springt über den faulen Hund.", |
| | "Le rapide renard brun saute par-dessus le chien paresseux.", |
| | ] |
| | batch = tok(texts, padding=True, return_tensors="pt") |
| | print(batch["input_ids"].shape) # (3, max_len) |
| | ``` |
| |
|
| | ### Special Tokens |
| |
|
| | | Token | ID | Purpose | |
| | |-------|------|---------| |
| | | `<unk>` | 0 | Unknown | |
| | | `<bos>` | 1 | Beginning of sequence | |
| | | `<eos>` | 2 | End of sequence | |
| | | `<start_of_turn>` | 3 | Chat turn start | |
| | | `<end_of_turn>` | 4 | Chat turn end | |
| | | `<start_of_image>` | 5 | Image start | |
| | | `<end_of_image>` | 6 | Image end | |
| | | `<image_soft_token>` | 7 | Image placeholder | |
| | | `<fim_prefix>` | 8 | Fill-in-middle prefix | |
| | | `<fim_middle>` | 9 | Fill-in-middle middle | |
| | | `<fim_suffix>` | 10 | Fill-in-middle suffix | |
| | | `<tool_call>` | 11 | Tool call start | |
| | | `</tool_call>` | 12 | Tool call end | |
| | | `<unused_0>`–`<unused_99>` | 13–112 | Reserved for future use | |
| | | `<pad>` | 262,144 | Padding | |
| |
|
| | ## Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | **Algorithm** | BPE (SentencePiece) | |
| | | **Vocabulary size** | 262,144 | |
| | | **Training data** | 173 GB multilingual corpus | |
| | | **Data mix** | 70% English, 10% code/math, 20% other languages (37 languages) | |
| | | **Character coverage** | 0.9995 | |
| | | **Normalization** | Identity (lossless) | |
| | | **Byte fallback** | Enabled | |
| | | **Digit splitting** | Enabled | |
| | | **Max piece length** | 16 | |
| | | **Trained on** | LUMI HPC (CSC, Finland) | |
| | | **Training time** | ~9 hours (32 CPUs, 128 GB RAM) | |
| |
|
| | ### Data Sources |
| |
|
| | The training corpus aggregates cleaned/deduplicated text from: C4, FineWeb-2, Nemotron-CC, MADLAD-400, HPLT, FinePDFs, German-Commons, StarCoder, Proof-Pile-2, Cosmopedia-v2, and FineMath. |
| |
|
| | ### Languages (37 + English) |
| |
|
| | **EU Official (23):** bg, hr, cs, da, nl, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv |
| |
|
| | **Additional European (14):** sq, eu, bs, ca, gl, is, lb, mk, no, ru, sr, tr, uk, cy |
| |
|
| | ## Fertility Evaluation |
| |
|
| | Average tokens per word across 38 European languages (lower = better), evaluated on 200 Wikipedia articles per language: |
| |
|
| | | Tokenizer | Vocab | Avg Fertility | Languages Won | |
| | |-----------|-------|---------------|---------------| |
| | | **Ours 262k** | **262k** | **2.12** | **26** | |
| | | GPT-OSS 20B | 200k | 2.26 | 8 | |
| | | EuroLLM 1.7B | 128k | 2.27 | 3 | |
| | | Ours 128k | 131k | 2.31 | 0 | |
| | | Gemma 3 4B | 262k | 2.35 | 0 | |
| | | DeepSeek V3 | 129k | 2.52 | 0 | |
| | | Llama 3.2 1B | 128k | 2.56 | 1 | |
| | | Qwen 2.5 | 152k | 2.83 | 0 | |
| | | Mistral v0.3 | 33k | 2.97 | 0 | |
| |
|
| | ## See Also |
| |
|
| | - [openeurollm/tokenizer-128k](https://huggingface.co/openeurollm/tokenizer-128k) — 128k vocab variant (half the vocabulary, ~9% more tokens per word) |
| |
|