QT_V.2 96K — Best All-Round Multilingual Tokenizer

Fewest total tokens on FLORES-200 of any tokenizer tested. 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric QT 96K QT Code 114K QT 64K Llama 3 (128K) Tekken (131K) Qwen 2.5 (152K)
Total tokens 12,961,617 13,007,924 13,592,357 16,764,198 14,421,539 15,425,680
Equity ratio 31.6× 43.3× 41.0× 118.6× 127.9× 77.7×
Mean fertility 3.94 4.03 4.18 5.72 5.34 4.91
Worst language lao (43.0) lao (58.0) lao (58.0) bod (149.8) bod (168.4) bod (98.0)

QT 96K wins on total tokens, equity, and mean fertility. The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is 3.6× more expensive than QT 96K's Tibetan (41.1 tok/word).

Script Family Averages (FLORES-200 tok/word)

Script Family QT 96K Llama 3 Tekken Qwen 2.5
Latin (37 langs) 2.20 2.39 2.20 2.41
Cyrillic (5) 2.23 2.59 2.27 2.99
CJK (4) 17.17 19.75 21.36 17.26
Indic Other (9) 4.21 12.42 6.77 10.37
SE Asian (4) 20.70 31.08 38.22 24.04
Unique Scripts (6) 9.35 32.96 32.05 21.39

QT 96K is 3× more efficient than Llama 3 on Indic languages, and 3.4× more efficient on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek).

Field Benchmark (66 Tests)

Metric Value
Total tokens 3,339
vs Llama 3 (128K) 40.8% fewer tokens
vs Tekken (131K) 23.2% fewer tokens
vs Qwen 2.5 (152K) 35.6% fewer tokens

Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken).

When to Use This Variant

QT_V.2 96K is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations.

Also available: QT_V.2 64K (smallest embedding) · QT_V.2 Code 114K (multilingual coding)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)

Specifications

Spec Value
Vocabulary 96,000
Languages 71 natural + 14 code
Script families 26
Pretokenizer Llama 3 regex
Arithmetic Single-digit splitting
Max token length 15 chars
Avg token length 6.1 chars
Compression 3.17 chars/token

Training

Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%.

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support