QT_V.2 64K — Multilingual BPE Tokenizer

The most equitable 64K tokenizer available. 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric	QT 64K	QT 96K	QT Code 114K	Llama 3 (128K)	Tekken (131K)	Qwen 2.5 (152K)
Total tokens	13,592,357	12,961,617	13,007,924	16,764,198	14,421,539	15,425,680
Equity ratio	41.0×	31.6×	43.3×	118.6×	127.9×	77.7×
Mean fertility	4.18	3.94	4.03	5.72	5.34	4.91

The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is 2.9× more equitable than Llama 3 (118.6×) and 3.1× more equitable than Tekken (127.9×) — at half the vocabulary.

Where QT 64K Dominates (FLORES-200 tok/word)

Language	QT 64K	Llama 3	Tekken	Qwen 2.5
Tibetan	42.5	149.8	168.4	98.0
Odia	4.16	16.90	18.30	13.65
Khmer	17.1	40.9	70.5	30.7
Georgian	3.83	15.47	3.93	8.33
Sinhala	3.84	11.37	16.60	9.17
Amharic	3.90	11.95	11.98	6.45

Field Benchmark (66 Tests)

Metric	Value
Total tokens	3,593
vs Llama 3 (128K)	36.3% fewer tokens
vs Tekken (131K)	17.3% fewer tokens
vs Qwen 2.5 (152K)	30.7% fewer tokens

When to Use This Variant

QT_V.2 64K is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters.

Also available: QT_V.2 96K (best all-round) · QT_V.2 Code 114K (multilingual coding)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)

Specifications

Spec	Value
Vocabulary	64,000
Languages	71 natural + 14 code
Script families	26
Pretokenizer	Llama 3 regex
Arithmetic	Single-digit splitting
Max token length	15 chars
Avg token length	5.7 chars

Training

Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1).

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}

Downloads last month: -; Downloads are not tracked for this model. How to track