OpenEuroLLM Tokenizer (128k)

A 131,072-token SentencePiece BPE tokenizer designed for efficient tokenization across all EU official languages and additional European languages. Trained on 173 GB of curated multilingual text from the OpenEuroLLM data catalogue on LUMI HPC.

This is the compact variant — half the vocabulary of the 262k version with only ~9% higher fertility. Ideal when embedding table size matters.

Key Results

  • Ranks 4th overall across 9 tokenizers with average fertility of 2.31
  • Beats Gemma 3 (2.35), DeepSeek V3 (2.52), Llama 3.2 (2.56), Qwen 2.5 (2.83), and Mistral v0.3 (2.97)
  • Half the vocabulary of the 262k variant, with only ~9% more tokens per word
  • English fertility (1.66) remains competitive

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-128k")

text = "Hello world! Bonjour le monde. Hej världen!"
ids = tok(text)["input_ids"]
decoded = tok.decode(ids, skip_special_tokens=True)

print(f"Tokens: {len(ids)}")
print(f"Decoded: {decoded}")

Batch encoding

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Der schnelle braune Fuchs springt über den faulen Hund.",
    "Le rapide renard brun saute par-dessus le chien paresseux.",
]
batch = tok(texts, padding=True, return_tensors="pt")
print(batch["input_ids"].shape)  # (3, max_len)

Special Tokens

Token ID Purpose
<unk> 0 Unknown
<bos> 1 Beginning of sequence
<eos> 2 End of sequence
<start_of_turn> 3 Chat turn start
<end_of_turn> 4 Chat turn end
<start_of_image> 5 Image start
<end_of_image> 6 Image end
<image_soft_token> 7 Image placeholder
<fim_prefix> 8 Fill-in-middle prefix
<fim_middle> 9 Fill-in-middle middle
<fim_suffix> 10 Fill-in-middle suffix
<tool_call> 11 Tool call start
</tool_call> 12 Tool call end
<unused_0><unused_99> 13–112 Reserved for future use
<pad> 131,072 Padding

Training Details

Parameter Value
Algorithm BPE (SentencePiece)
Vocabulary size 131,072
Training data 173 GB multilingual corpus
Data mix 70% English, 10% code/math, 20% other languages (37 languages)
Character coverage 0.9995
Normalization Identity (lossless)
Byte fallback Enabled
Digit splitting Enabled
Max piece length 16
Trained on LUMI HPC (CSC, Finland)
Training time ~3h14m (32 CPUs, 128 GB RAM)

Data Sources

The training corpus aggregates cleaned/deduplicated text from: C4, FineWeb-2, Nemotron-CC, MADLAD-400, HPLT, FinePDFs, German-Commons, StarCoder, Proof-Pile-2, Cosmopedia-v2, and FineMath.

Languages (37 + English)

EU Official (23): bg, hr, cs, da, nl, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv

Additional European (14): sq, eu, bs, ca, gl, is, lb, mk, no, ru, sr, tr, uk, cy

Fertility Evaluation

Average tokens per word across 38 European languages (lower = better), evaluated on 200 Wikipedia articles per language:

Tokenizer Vocab Avg Fertility Languages Won
Ours 262k 262k 2.12 26
GPT-OSS 20B 200k 2.26 8
EuroLLM 1.7B 128k 2.27 3
Ours 128k 131k 2.31 0
Gemma 3 4B 262k 2.35 0
DeepSeek V3 129k 2.52 0
Llama 3.2 1B 128k 2.56 1
Qwen 2.5 152k 2.83 0
Mistral v0.3 33k 2.97 0

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support