QT_V.2 64K — Multilingual BPE Tokenizer

The most equitable 64K tokenizer available. 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric QT 64K QT 96K QT Code 114K Llama 3 (128K) Tekken (131K) Qwen 2.5 (152K)
Total tokens 13,592,357 12,961,617 13,007,924 16,764,198 14,421,539 15,425,680
Equity ratio 41.0× 31.6× 43.3× 118.6× 127.9× 77.7×
Mean fertility 4.18 3.94 4.03 5.72 5.34 4.91

The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is 2.9× more equitable than Llama 3 (118.6×) and 3.1× more equitable than Tekken (127.9×) — at half the vocabulary.

Where QT 64K Dominates (FLORES-200 tok/word)

Language QT 64K Llama 3 Tekken Qwen 2.5
Tibetan 42.5 149.8 168.4 98.0
Odia 4.16 16.90 18.30 13.65
Khmer 17.1 40.9 70.5 30.7
Georgian 3.83 15.47 3.93 8.33
Sinhala 3.84 11.37 16.60 9.17
Amharic 3.90 11.95 11.98 6.45

Field Benchmark (66 Tests)

Metric Value
Total tokens 3,593
vs Llama 3 (128K) 36.3% fewer tokens
vs Tekken (131K) 17.3% fewer tokens
vs Qwen 2.5 (152K) 30.7% fewer tokens

When to Use This Variant

QT_V.2 64K is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters.

Also available: QT_V.2 96K (best all-round) · QT_V.2 Code 114K (multilingual coding)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)

Specifications

Spec Value
Vocabulary 64,000
Languages 71 natural + 14 code
Script families 26
Pretokenizer Llama 3 regex
Arithmetic Single-digit splitting
Max token length 15 chars
Avg token length 5.7 chars

Training

Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1).

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support