OpenEuroLLM Tokenizer v2 (128k)
128k-vocab variant of openeurollm/tokenizer-256k-v2. Same training corpus, same special tokens, smaller vocab for compute-constrained deployments.
Highlights vs SOTA β multi-domain eval (128k class)
Evaluation is a held-out 5-domain suite (8,600 samples total). Each column is mean tokens-per-whitespace-word (lower = better).
| Tokenizer | Vocab | Overall | FLORES-200 (36 langs parallel) |
Code (Python) |
Math (LaTeX+GSM8K) |
Chat (ChatML) |
PDFs (5 langs) |
|---|---|---|---|---|---|---|---|
| OpenEuroLLM v2 128k (this model) | 131,072 | 2.09 π₯ | 2.00 π₯ | 3.32 | 1.92 | 1.52 π₯ | 2.43 |
| Mistral Nemo | 131,072 | 2.23 | 2.20 | 2.84 | 1.92 | 1.62 | 2.26 |
| EuroLLM 9B | 128,000 | 2.30 | 2.21 | 3.79 | 2.02 | 1.57 | 2.48 |
| DeepSeek V3 | 128,000 | 2.47 | 2.51 | 2.83 | 1.65 π₯ | 1.65 | 2.13 π₯ |
| OpenEuroLLM v1 128k (predecessor) | 131,072 | 2.62 | 2.62 | 3.42 | 1.98 | 1.76 | 2.40 |
| Llama 3.2 1B | 128,256 | 2.68 | 2.78 | 2.60 π₯ | 1.65 | 1.65 | 2.18 |
For the full SOTA comparison incl. 256k-class tokenizers, see tokenizer-256k-v2.
v2-128k vs the field:
- #1 overall (2.09) in its class β beats Mistral Nemo, EuroLLM, DeepSeek, Llama 3.2.
- #1 on multilingual prose (FLORES 2.00 vs Mistral 2.20, EuroLLM 2.21).
- #1 on chat (1.52, ChatML tokens working).
- Loses on Python code (3.32 vs Llama 2.60); Llama's tiktoken-based BPE is more code-aggressive.
v2 vs v1: per-language deltas on FLORES-200
FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift).
| Language | v1 128k | v2 128k | Ξ |
|---|---|---|---|
| English (en) | 1.29 | 1.23 | β0.06 |
| Albanian (sq) | 2.44 | 1.76 | β0.68 |
| Basque (eu) | 2.28 | 2.12 | β0.17 |
| Bosnian (bs) | 1.84 | 1.78 | β0.06 |
| Bulgarian (bg) | 1.95 | 2.13 | +0.18 |
| Catalan (ca) | 1.77 | 1.70 | β0.07 |
| Croatian (hr) | 1.91 | 1.82 | β0.09 |
| Czech (cs) | 1.72 | 2.04 | +0.31 |
| Danish (da) | 1.76 | 1.69 | β0.07 |
| Dutch (nl) | 1.77 | 1.68 | β0.09 |
| Estonian (et) | 2.41 | 2.30 | β0.11 |
| Finnish (fi) | 2.71 | 2.59 | β0.13 |
| French (fr) | 1.74 | 1.67 | β0.07 |
| Galician (gl) | 1.64 | 1.58 | β0.06 |
| Georgian (ka) | 22.93 | 3.30 | β19.63 |
| German (de) | 1.61 | 1.86 | +0.25 |
| Greek (el) | 2.63 | 2.44 | β0.20 |
| Hungarian (hu) | 2.44 | 2.33 | β0.12 |
| Icelandic (is) | 2.21 | 2.05 | β0.16 |
| Irish (ga) | 1.91 | 1.79 | β0.12 |
| Italian (it) | 1.45 | 1.66 | +0.21 |
| Latvian (lv) | 3.18 | 2.20 | β0.98 |
| Lithuanian (lt) | 2.30 | 2.27 | β0.03 |
| Macedonian (mk) | 1.99 | 2.13 | +0.13 |
| Maltese (mt) | 2.59 | 2.47 | β0.12 |
| Norwegian (no) | 1.69 | 1.65 | β0.04 |
| Polish (pl) | 1.95 | 2.16 | +0.21 |
| Portuguese (pt) | 1.66 | 1.60 | β0.06 |
| Romanian (ro) | 1.92 | 1.75 | β0.17 |
| Serbian (sr) | 2.20 | 2.23 | +0.02 |
| Slovak (sk) | 2.04 | 2.12 | +0.08 |
| Slovene (sl) | 1.97 | 1.93 | β0.04 |
| Spanish (es) | 1.60 | 1.54 | β0.05 |
| Swedish (sv) | 1.85 | 1.81 | β0.04 |
| Turkish (tr) | 2.40 | 2.16 | β0.24 |
| Ukrainian (uk) | 2.42 | 2.46 | +0.04 |
| Average (36 catalogue langs) | 2.62 | 2.00 | β0.62 |
v2-128k improves on 29/36 languages. Biggest wins: Georgian β19.63, Latvian β0.98, Albanian β0.68. The few regressions (bg/cs/de/it/mk/pl/sk/uk) are small (+0.02 to +0.31). Note: lb/ru/cy aren't tested β they're not in the catalogue and were dropped from v2's training scope.
Language coverage
Same as 256k-v2: 36 catalogue languages, Georgian (ka) included, lb/ru/cy dropped.
Training details
- Algorithm: SentencePiece BPE
- Vocab size: 131,072 (2^17)
- Normalization: identity (lossless), byte fallback enabled
- Corpus: identical to 256k-v2 (500 GB, catalogue-driven multi-lingual mix)
Special tokens
Core (locked at fixed IDs, in-vocab):
| Token | ID |
|---|---|
<unk> |
0 |
<bos> |
1 |
<eos> |
2 |
<pad> |
3 |
Fixes v1's pad-out-of-vocab bug (v1 had <pad> at vocab_size+0 = 131072).
204 user-defined symbols reserved β same set as 256k-v2:
| Category | Count | Highlights | v1? |
|---|---|---|---|
| Whitespace family | 45 | 2..32-space indents, 1..8 tabs, multi-newlines | β new |
| StarCoder code markers | 16 | <filename>, <reponame>, <file_sep>, <jupyter_*>, <commit_*> |
β new |
| FIM (code completion) | 4 | <fim_prefix/middle/suffix/pad> |
3 (no <fim_pad>) |
| ChatML chat | 2 | `< | im_start |
| Gemma chat | 2 | <start_of_turn>, <end_of_turn> |
β |
| Tool use | 2 | <tool_call>, </tool_call> |
β |
| Reasoning | 2 | <think>, </think> (DeepSeek-R1 / Qwen3) |
β new |
| Multimodal | 3 | <start_of_image>, <end_of_image>, <image_soft_token> |
β |
| Reserved unused | 128 | <unused_0> β¦ <unused_127> |
100 in v1 |
Why the whitespace tokens matter for code
A Python file with an 8-space indent tokenizes as 1 token in v2 (it's reserved as " ") but as 4-8 tokens in v1 and most other tokenizers. For a code-heavy training run this is a meaningful efficiency gain β see the 256k-v2 card for the example + full breakdown.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-128k-v2", use_fast=True)
Files
tokenizer.modelβ SentencePiece BPE model (2.2 MB)tokenizer.vocabβ vocabulary listingspecial_tokens_map.jsonβ HF special tokens maptokenizer_config.jsonβ HF tokenizer config
Citation
Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.