OpenEuroLLM Tokenizer v2 (128k)

128k-vocab variant of openeurollm/tokenizer-256k-v2. Same training corpus, same special tokens, smaller vocab for compute-constrained deployments.

Highlights vs SOTA — multi-domain eval (128k class)

Evaluation is a held-out 5-domain suite (8,600 samples total). Each column is mean tokens-per-whitespace-word (lower = better).

Tokenizer	Vocab	Overall	FLORES-200 (36 langs parallel)	Code (Python)	Math (LaTeX+GSM8K)	Chat (ChatML)	PDFs (5 langs)
OpenEuroLLM v2 128k (this model)	131,072	2.09 🥇	2.00 🥇	3.32	1.92	1.52 🥇	2.43
Mistral Nemo	131,072	2.23	2.20	2.84	1.92	1.62	2.26
EuroLLM 9B	128,000	2.30	2.21	3.79	2.02	1.57	2.48
DeepSeek V3	128,000	2.47	2.51	2.83	1.65 🥇	1.65	2.13 🥇
OpenEuroLLM v1 128k (predecessor)	131,072	2.62	2.62	3.42	1.98	1.76	2.40
Llama 3.2 1B	128,256	2.68	2.78	2.60 🥇	1.65	1.65	2.18

For the full SOTA comparison incl. 256k-class tokenizers, see tokenizer-256k-v2.

v2-128k vs the field:

#1 overall (2.09) in its class — beats Mistral Nemo, EuroLLM, DeepSeek, Llama 3.2.
#1 on multilingual prose (FLORES 2.00 vs Mistral 2.20, EuroLLM 2.21).
#1 on chat (1.52, ChatML tokens working).
Loses on Python code (3.32 vs Llama 2.60); Llama's tiktoken-based BPE is more code-aggressive.

v2 vs v1: per-language deltas on FLORES-200

FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift).

Language	v1 128k	v2 128k	Δ
English (en)	1.29	1.23	−0.06
Albanian (sq)	2.44	1.76	−0.68
Basque (eu)	2.28	2.12	−0.17
Bosnian (bs)	1.84	1.78	−0.06
Bulgarian (bg)	1.95	2.13	+0.18
Catalan (ca)	1.77	1.70	−0.07
Croatian (hr)	1.91	1.82	−0.09
Czech (cs)	1.72	2.04	+0.31
Danish (da)	1.76	1.69	−0.07
Dutch (nl)	1.77	1.68	−0.09
Estonian (et)	2.41	2.30	−0.11
Finnish (fi)	2.71	2.59	−0.13
French (fr)	1.74	1.67	−0.07
Galician (gl)	1.64	1.58	−0.06
Georgian (ka)	22.93	3.30	−19.63
German (de)	1.61	1.86	+0.25
Greek (el)	2.63	2.44	−0.20
Hungarian (hu)	2.44	2.33	−0.12
Icelandic (is)	2.21	2.05	−0.16
Irish (ga)	1.91	1.79	−0.12
Italian (it)	1.45	1.66	+0.21
Latvian (lv)	3.18	2.20	−0.98
Lithuanian (lt)	2.30	2.27	−0.03
Macedonian (mk)	1.99	2.13	+0.13
Maltese (mt)	2.59	2.47	−0.12
Norwegian (no)	1.69	1.65	−0.04
Polish (pl)	1.95	2.16	+0.21
Portuguese (pt)	1.66	1.60	−0.06
Romanian (ro)	1.92	1.75	−0.17
Serbian (sr)	2.20	2.23	+0.02
Slovak (sk)	2.04	2.12	+0.08
Slovene (sl)	1.97	1.93	−0.04
Spanish (es)	1.60	1.54	−0.05
Swedish (sv)	1.85	1.81	−0.04
Turkish (tr)	2.40	2.16	−0.24
Ukrainian (uk)	2.42	2.46	+0.04
Average (36 catalogue langs)	2.62	2.00	−0.62

v2-128k improves on 29/36 languages. Biggest wins: Georgian −19.63, Latvian −0.98, Albanian −0.68. The few regressions (bg/cs/de/it/mk/pl/sk/uk) are small (+0.02 to +0.31). Note: lb/ru/cy aren't tested — they're not in the catalogue and were dropped from v2's training scope.

Language coverage

Same as 256k-v2: 36 catalogue languages, Georgian (ka) included, lb/ru/cy dropped.

Training details

Algorithm: SentencePiece BPE
Vocab size: 131,072 (2^17)
Normalization: identity (lossless), byte fallback enabled
Corpus: identical to 256k-v2 (500 GB, catalogue-driven multi-lingual mix)

Special tokens

Core (locked at fixed IDs, in-vocab):

Token	ID
`<unk>`	0
`<bos>`	1
`<eos>`	2
`<pad>`	3

Fixes v1's pad-out-of-vocab bug (v1 had <pad> at vocab_size+0 = 131072).

204 user-defined symbols reserved — same set as 256k-v2:

Category	Count	Highlights	v1?
Whitespace family	45	2..32-space indents, 1..8 tabs, multi-newlines	❌ new
StarCoder code markers	16	`<filename>`, `<reponame>`, `<file_sep>`, `<jupyter_>`, `<commit_>`	❌ new
FIM (code completion)	4	`<fim_prefix/middle/suffix/pad>`	3 (no `<fim_pad>`)
ChatML chat	2	`<	im_start
Gemma chat	2	`<start_of_turn>`, `<end_of_turn>`	✅
Tool use	2	`<tool_call>`, `</tool_call>`	✅
Reasoning	2	`<think>`, `</think>` (DeepSeek-R1 / Qwen3)	❌ new
Multimodal	3	`<start_of_image>`, `<end_of_image>`, `<image_soft_token>`	✅
Reserved unused	128	`<unused_0>` … `<unused_127>`	100 in v1

Why the whitespace tokens matter for code

A Python file with an 8-space indent tokenizes as 1 token in v2 (it's reserved as " ") but as 4-8 tokens in v1 and most other tokenizers. For a code-heavy training run this is a meaningful efficiency gain — see the 256k-v2 card for the example + full breakdown.

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-128k-v2", use_fast=True)

Files

tokenizer.model — SentencePiece BPE model (2.2 MB)
tokenizer.vocab — vocabulary listing
special_tokens_map.json — HF special tokens map
tokenizer_config.json — HF tokenizer config

Citation

Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support