--- language: - bg - bs - ca - cs - cy - da - de - el - en - es - et - eu - fi - fr - ga - gl - hr - hu - is - it - lb - lt - lv - mk - mt - nl - "no" - pl - pt - ro - ru - sk - sl - sq - sr - sv - tr - uk tags: - tokenizer - sentencepiece - bpe - multilingual - european-languages license: apache-2.0 --- # OpenEuroLLM Tokenizer (256k) A **262,144-token SentencePiece BPE tokenizer** designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC. ## Key Results - **Best fertility (tokens per word) on 26 out of 38 European languages**, outperforming Llama 3.2, Gemma 3, GPT-OSS, EuroLLM, Qwen 2.5, DeepSeek V3, and Mistral v0.3 - **Lowest average fertility overall: 2.12** across 38 languages - Particularly strong on lower-resource EU languages: Lithuanian (31% fewer tokens than Llama), Hungarian (22%), Icelandic (26%), Maltese (25%) - English fertility (1.58) within 5% of best (GPT-OSS, 1.51) — no trade-off ## Usage ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k") text = "Hello world! Bonjour le monde. Hej världen!" ids = tok(text)["input_ids"] decoded = tok.decode(ids, skip_special_tokens=True) print(f"Tokens: {len(ids)}") print(f"Decoded: {decoded}") ``` ### Batch encoding ```python texts = [ "The quick brown fox jumps over the lazy dog.", "Der schnelle braune Fuchs springt über den faulen Hund.", "Le rapide renard brun saute par-dessus le chien paresseux.", ] batch = tok(texts, padding=True, return_tensors="pt") print(batch["input_ids"].shape) # (3, max_len) ``` ### Special Tokens | Token | ID | Purpose | |-------|------|---------| | `` | 0 | Unknown | | `` | 1 | Beginning of sequence | | `` | 2 | End of sequence | | `` | 3 | Chat turn start | | `` | 4 | Chat turn end | | `` | 5 | Image start | | `` | 6 | Image end | | `` | 7 | Image placeholder | | `` | 8 | Fill-in-middle prefix | | `` | 9 | Fill-in-middle middle | | `` | 10 | Fill-in-middle suffix | | `` | 11 | Tool call start | | `` | 12 | Tool call end | | ``–`` | 13–112 | Reserved for future use | | `` | 262,144 | Padding | ## Training Details | Parameter | Value | |-----------|-------| | **Algorithm** | BPE (SentencePiece) | | **Vocabulary size** | 262,144 | | **Training data** | 173 GB multilingual corpus | | **Data mix** | 70% English, 10% code/math, 20% other languages (37 languages) | | **Character coverage** | 0.9995 | | **Normalization** | Identity (lossless) | | **Byte fallback** | Enabled | | **Digit splitting** | Enabled | | **Max piece length** | 16 | | **Trained on** | LUMI HPC (CSC, Finland) | | **Training time** | ~9 hours (32 CPUs, 128 GB RAM) | ### Data Sources The training corpus aggregates cleaned/deduplicated text from: C4, FineWeb-2, Nemotron-CC, MADLAD-400, HPLT, FinePDFs, German-Commons, StarCoder, Proof-Pile-2, Cosmopedia-v2, and FineMath. ### Languages (37 + English) **EU Official (23):** bg, hr, cs, da, nl, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv **Additional European (14):** sq, eu, bs, ca, gl, is, lb, mk, no, ru, sr, tr, uk, cy ## Fertility Evaluation Average tokens per word across 38 European languages (lower = better), evaluated on 200 Wikipedia articles per language: | Tokenizer | Vocab | Avg Fertility | Languages Won | |-----------|-------|---------------|---------------| | **Ours 262k** | **262k** | **2.12** | **26** | | GPT-OSS 20B | 200k | 2.26 | 8 | | EuroLLM 1.7B | 128k | 2.27 | 3 | | Ours 128k | 131k | 2.31 | 0 | | Gemma 3 4B | 262k | 2.35 | 0 | | DeepSeek V3 | 129k | 2.52 | 0 | | Llama 3.2 1B | 128k | 2.56 | 1 | | Qwen 2.5 | 152k | 2.83 | 0 | | Mistral v0.3 | 33k | 2.97 | 0 | ## See Also - [openeurollm/tokenizer-128k](https://huggingface.co/openeurollm/tokenizer-128k) — 128k vocab variant (half the vocabulary, ~9% more tokens per word)