OpenEuroLLM Tokenizer v2 (128k)

128k-vocab variant of openeurollm/tokenizer-256k-v2. Same training corpus, same special tokens, smaller vocab for compute-constrained deployments.

Highlights vs SOTA β€” multi-domain eval (128k class)

Evaluation is a held-out 5-domain suite (8,600 samples total). Each column is mean tokens-per-whitespace-word (lower = better).

Tokenizer Vocab Overall FLORES-200
(36 langs parallel)
Code
(Python)
Math
(LaTeX+GSM8K)
Chat
(ChatML)
PDFs
(5 langs)
OpenEuroLLM v2 128k (this model) 131,072 2.09 πŸ₯‡ 2.00 πŸ₯‡ 3.32 1.92 1.52 πŸ₯‡ 2.43
Mistral Nemo 131,072 2.23 2.20 2.84 1.92 1.62 2.26
EuroLLM 9B 128,000 2.30 2.21 3.79 2.02 1.57 2.48
DeepSeek V3 128,000 2.47 2.51 2.83 1.65 πŸ₯‡ 1.65 2.13 πŸ₯‡
OpenEuroLLM v1 128k (predecessor) 131,072 2.62 2.62 3.42 1.98 1.76 2.40
Llama 3.2 1B 128,256 2.68 2.78 2.60 πŸ₯‡ 1.65 1.65 2.18

For the full SOTA comparison incl. 256k-class tokenizers, see tokenizer-256k-v2.

v2-128k vs the field:

  • #1 overall (2.09) in its class β€” beats Mistral Nemo, EuroLLM, DeepSeek, Llama 3.2.
  • #1 on multilingual prose (FLORES 2.00 vs Mistral 2.20, EuroLLM 2.21).
  • #1 on chat (1.52, ChatML tokens working).
  • Loses on Python code (3.32 vs Llama 2.60); Llama's tiktoken-based BPE is more code-aggressive.

v2 vs v1: per-language deltas on FLORES-200

FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift).

Language v1 128k v2 128k Ξ”
English (en) 1.29 1.23 βˆ’0.06
Albanian (sq) 2.44 1.76 βˆ’0.68
Basque (eu) 2.28 2.12 βˆ’0.17
Bosnian (bs) 1.84 1.78 βˆ’0.06
Bulgarian (bg) 1.95 2.13 +0.18
Catalan (ca) 1.77 1.70 βˆ’0.07
Croatian (hr) 1.91 1.82 βˆ’0.09
Czech (cs) 1.72 2.04 +0.31
Danish (da) 1.76 1.69 βˆ’0.07
Dutch (nl) 1.77 1.68 βˆ’0.09
Estonian (et) 2.41 2.30 βˆ’0.11
Finnish (fi) 2.71 2.59 βˆ’0.13
French (fr) 1.74 1.67 βˆ’0.07
Galician (gl) 1.64 1.58 βˆ’0.06
Georgian (ka) 22.93 3.30 βˆ’19.63
German (de) 1.61 1.86 +0.25
Greek (el) 2.63 2.44 βˆ’0.20
Hungarian (hu) 2.44 2.33 βˆ’0.12
Icelandic (is) 2.21 2.05 βˆ’0.16
Irish (ga) 1.91 1.79 βˆ’0.12
Italian (it) 1.45 1.66 +0.21
Latvian (lv) 3.18 2.20 βˆ’0.98
Lithuanian (lt) 2.30 2.27 βˆ’0.03
Macedonian (mk) 1.99 2.13 +0.13
Maltese (mt) 2.59 2.47 βˆ’0.12
Norwegian (no) 1.69 1.65 βˆ’0.04
Polish (pl) 1.95 2.16 +0.21
Portuguese (pt) 1.66 1.60 βˆ’0.06
Romanian (ro) 1.92 1.75 βˆ’0.17
Serbian (sr) 2.20 2.23 +0.02
Slovak (sk) 2.04 2.12 +0.08
Slovene (sl) 1.97 1.93 βˆ’0.04
Spanish (es) 1.60 1.54 βˆ’0.05
Swedish (sv) 1.85 1.81 βˆ’0.04
Turkish (tr) 2.40 2.16 βˆ’0.24
Ukrainian (uk) 2.42 2.46 +0.04
Average (36 catalogue langs) 2.62 2.00 βˆ’0.62

v2-128k improves on 29/36 languages. Biggest wins: Georgian βˆ’19.63, Latvian βˆ’0.98, Albanian βˆ’0.68. The few regressions (bg/cs/de/it/mk/pl/sk/uk) are small (+0.02 to +0.31). Note: lb/ru/cy aren't tested β€” they're not in the catalogue and were dropped from v2's training scope.

Language coverage

Same as 256k-v2: 36 catalogue languages, Georgian (ka) included, lb/ru/cy dropped.

Training details

  • Algorithm: SentencePiece BPE
  • Vocab size: 131,072 (2^17)
  • Normalization: identity (lossless), byte fallback enabled
  • Corpus: identical to 256k-v2 (500 GB, catalogue-driven multi-lingual mix)

Special tokens

Core (locked at fixed IDs, in-vocab):

Token ID
<unk> 0
<bos> 1
<eos> 2
<pad> 3

Fixes v1's pad-out-of-vocab bug (v1 had <pad> at vocab_size+0 = 131072).

204 user-defined symbols reserved β€” same set as 256k-v2:

Category Count Highlights v1?
Whitespace family 45 2..32-space indents, 1..8 tabs, multi-newlines ❌ new
StarCoder code markers 16 <filename>, <reponame>, <file_sep>, <jupyter_*>, <commit_*> ❌ new
FIM (code completion) 4 <fim_prefix/middle/suffix/pad> 3 (no <fim_pad>)
ChatML chat 2 `< im_start
Gemma chat 2 <start_of_turn>, <end_of_turn> βœ…
Tool use 2 <tool_call>, </tool_call> βœ…
Reasoning 2 <think>, </think> (DeepSeek-R1 / Qwen3) ❌ new
Multimodal 3 <start_of_image>, <end_of_image>, <image_soft_token> βœ…
Reserved unused 128 <unused_0> … <unused_127> 100 in v1

Why the whitespace tokens matter for code

A Python file with an 8-space indent tokenizes as 1 token in v2 (it's reserved as " ") but as 4-8 tokens in v1 and most other tokenizers. For a code-heavy training run this is a meaningful efficiency gain β€” see the 256k-v2 card for the example + full breakdown.

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-128k-v2", use_fast=True)

Files

  • tokenizer.model β€” SentencePiece BPE model (2.2 MB)
  • tokenizer.vocab β€” vocabulary listing
  • special_tokens_map.json β€” HF special tokens map
  • tokenizer_config.json β€” HF tokenizer config

Citation

Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support