QT_V.2 96K — Best All-Round Multilingual Tokenizer
Fewest total tokens on FLORES-200 of any tokenizer tested. 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors.
Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.
FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT 96K | QT Code 114K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| Total tokens | 12,961,617 ✓ | 13,007,924 | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 |
| Equity ratio | 31.6× ✓ | 43.3× | 41.0× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 3.94 ✓ | 4.03 | 4.18 | 5.72 | 5.34 | 4.91 |
| Worst language | lao (43.0) | lao (58.0) | lao (58.0) | bod (149.8) | bod (168.4) | bod (98.0) |
QT 96K wins on total tokens, equity, and mean fertility. The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is 3.6× more expensive than QT 96K's Tibetan (41.1 tok/word).
Script Family Averages (FLORES-200 tok/word)
| Script Family | QT 96K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Latin (37 langs) | 2.20 | 2.39 | 2.20 | 2.41 |
| Cyrillic (5) | 2.23 | 2.59 | 2.27 | 2.99 |
| CJK (4) | 17.17 | 19.75 | 21.36 | 17.26 |
| Indic Other (9) | 4.21 | 12.42 | 6.77 | 10.37 |
| SE Asian (4) | 20.70 | 31.08 | 38.22 | 24.04 |
| Unique Scripts (6) | 9.35 | 32.96 | 32.05 | 21.39 |
QT 96K is 3× more efficient than Llama 3 on Indic languages, and 3.4× more efficient on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek).
Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| Total tokens | 3,339 |
| vs Llama 3 (128K) | 40.8% fewer tokens |
| vs Tekken (131K) | 23.2% fewer tokens |
| vs Qwen 2.5 (152K) | 35.6% fewer tokens |
Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken).
When to Use This Variant
QT_V.2 96K is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations.
Also available: QT_V.2 64K (smallest embedding) · QT_V.2 Code 114K (multilingual coding)
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)
Specifications
| Spec | Value |
|---|---|
| Vocabulary | 96,000 |
| Languages | 71 natural + 14 code |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 6.1 chars |
| Compression | 3.17 chars/token |
Training
Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%.
Files
tokenizer.json · vocab.json · merges.txt · training_report.json
Contact
Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com
License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}