language:
- en
- de
- fr
- es
- pt
- it
- nl
- pl
- ro
- cs
- sv
- da
- 'no'
- fi
- hu
- hr
- bg
- tr
- ca
- ru
- uk
- sr
- zh
- ja
- ko
- ar
- fa
- he
- hi
- bn
- th
- vi
- ka
- hy
- el
- yi
- ur
- ta
- te
- gu
- pa
- ml
- kn
- am
- si
- my
- km
- mr
- ne
- or
- bo
- dv
- eu
- gl
- gd
- et
- sk
- lt
- sl
- lv
- af
- sq
- sw
- is
- tl
- cy
- ga
- br
- la
- mk
- id
- code
license: apache-2.0
library_name: tokenizers
tags:
- tokenizer
- bpe
- multilingual
- code
- quartz
- aenea
- coding
- python
- flores
pipeline_tag: text-generation
QT_V.2 Code 114K — Multilingual Coding Tokenizer
Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size. 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.
Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.
FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT Code 114K | QT 96K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| Total tokens | 13,007,924 | 12,961,617 | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 |
| Equity ratio | 43.3× | 31.6× | 41.0× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 4.03 | 3.94 | 4.18 | 5.72 | 5.34 | 4.91 |
QT Code 114K uses 22.4% fewer tokens than Llama 3 and 9.8% fewer than Tekken across all 204 FLORES languages — with 10–37% less vocabulary.
Key FLORES Languages (tok/word)
| Language | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Japanese | 32.1 | 38.9 | 41.3 | 35.8 |
| Tibetan | 46.5 | 149.8 | 168.4 | 98.0 |
| Sinhala | 3.58 | 11.37 | 16.60 | 9.17 |
| Amharic | 3.40 | 11.95 | 11.98 | 6.45 |
| Georgian | 3.46 | 15.47 | 3.93 | 8.33 |
| Odia | 4.10 | 16.90 | 18.30 | 13.65 |
Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| Total tokens | 3,314 (lowest of any tokenizer) |
| vs Llama 3 (128K) | 41.2% fewer tokens |
| vs Tekken (131K) | 23.8% fewer tokens |
| vs Qwen 2.5 (152K) | 36.1% fewer tokens |
Code Performance
| Language | QT Code | QT 96K | QT 64K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|---|---|
| Python | 110 | 115 | 125 | 97 | 112 | 105 |
| JavaScript | 67 | 71 | 71 | 65 | 69 | 64 |
| Rust | 111 | 113 | 117 | 108 | 111 | 107 |
Python compression improved from 125 (64K) to 115 (96K) to 110 (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.
Category Totals (lower is better)
| Category | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Natural Languages (20) | 1,033 | 1,599 | 1,038 | 1,535 |
| V1 Expansion (14) | 662 | 1,758 | 1,092 | 1,509 |
| V2 New Scripts (3) | 188 | 692 | 740 | 523 |
| Celtic / Brythonic (8) | 312 | 391 | 341 | 384 |
| Code (3) | 288 | 270 | 292 | 276 |
| TOTAL (66 tests) | 3,314 | 5,639 | 4,347 | 5,183 |
When to Use This Variant
QT_V.2 Code 114K is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.
Also available: QT_V.2 64K (smallest embedding) · QT_V.2 96K (best all-round)
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)")
print(encoded.tokens)
Specifications
| Spec | Value |
|---|---|
| Vocabulary | 114,688 |
| Languages | 71 natural + 15 code (incl. CodeSearchNet) |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 6.24 chars |
| Compression | 3.60 chars/token |
Training
Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:
| Category | Share | Sources |
|---|---|---|
| Wikipedia | 37.3% | 71 languages (wiki_ultra_clean v7.3) |
| Code | 37.4% | 14 languages + CodeSearchNet Python (450K functions) |
| Stack Exchange | 25.3% | 49 sites (se_ultra_clean v1) |
Files
tokenizer.json · vocab.json · merges.txt · training_report.json
Contact
Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com
License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}