QT_V.2 64K — Multilingual BPE Tokenizer
The most equitable 64K tokenizer available. 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark.
Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.
FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT 64K | QT 96K | QT Code 114K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| Total tokens | 13,592,357 | 12,961,617 | 13,007,924 | 16,764,198 | 14,421,539 | 15,425,680 |
| Equity ratio | 41.0× | 31.6× | 43.3× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 4.18 | 3.94 | 4.03 | 5.72 | 5.34 | 4.91 |
The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is 2.9× more equitable than Llama 3 (118.6×) and 3.1× more equitable than Tekken (127.9×) — at half the vocabulary.
Where QT 64K Dominates (FLORES-200 tok/word)
| Language | QT 64K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Tibetan | 42.5 | 149.8 | 168.4 | 98.0 |
| Odia | 4.16 | 16.90 | 18.30 | 13.65 |
| Khmer | 17.1 | 40.9 | 70.5 | 30.7 |
| Georgian | 3.83 | 15.47 | 3.93 | 8.33 |
| Sinhala | 3.84 | 11.37 | 16.60 | 9.17 |
| Amharic | 3.90 | 11.95 | 11.98 | 6.45 |
Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| Total tokens | 3,593 |
| vs Llama 3 (128K) | 36.3% fewer tokens |
| vs Tekken (131K) | 17.3% fewer tokens |
| vs Qwen 2.5 (152K) | 30.7% fewer tokens |
When to Use This Variant
QT_V.2 64K is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters.
Also available: QT_V.2 96K (best all-round) · QT_V.2 Code 114K (multilingual coding)
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)
Specifications
| Spec | Value |
|---|---|
| Vocabulary | 64,000 |
| Languages | 71 natural + 14 code |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 5.7 chars |
Training
Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1).
Files
tokenizer.json · vocab.json · merges.txt · training_report.json
Contact
Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com
License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}