QT_V.2 96K — Best All-Round Multilingual Tokenizer

Fewest total tokens on FLORES-200 of any tokenizer tested. 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric	QT 96K	QT Code 114K	QT 64K	Llama 3 (128K)	Tekken (131K)	Qwen 2.5 (152K)
Total tokens	12,961,617 ✓	13,007,924	13,592,357	16,764,198	14,421,539	15,425,680
Equity ratio	31.6× ✓	43.3×	41.0×	118.6×	127.9×	77.7×
Mean fertility	3.94 ✓	4.03	4.18	5.72	5.34	4.91
Worst language	lao (43.0)	lao (58.0)	lao (58.0)	bod (149.8)	bod (168.4)	bod (98.0)

QT 96K wins on total tokens, equity, and mean fertility. The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is 3.6× more expensive than QT 96K's Tibetan (41.1 tok/word).

Script Family Averages (FLORES-200 tok/word)

Script Family	QT 96K	Llama 3	Tekken	Qwen 2.5
Latin (37 langs)	2.20	2.39	2.20	2.41
Cyrillic (5)	2.23	2.59	2.27	2.99
CJK (4)	17.17	19.75	21.36	17.26
Indic Other (9)	4.21	12.42	6.77	10.37
SE Asian (4)	20.70	31.08	38.22	24.04
Unique Scripts (6)	9.35	32.96	32.05	21.39

QT 96K is 3× more efficient than Llama 3 on Indic languages, and 3.4× more efficient on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek).

Field Benchmark (66 Tests)

Metric	Value
Total tokens	3,339
vs Llama 3 (128K)	40.8% fewer tokens
vs Tekken (131K)	23.2% fewer tokens
vs Qwen 2.5 (152K)	35.6% fewer tokens

Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken).

When to Use This Variant

QT_V.2 96K is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations.

Also available: QT_V.2 64K (smallest embedding) · QT_V.2 Code 114K (multilingual coding)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)

Specifications

Spec	Value
Vocabulary	96,000
Languages	71 natural + 14 code
Script families	26
Pretokenizer	Llama 3 regex
Arithmetic	Single-digit splitting
Max token length	15 chars
Avg token length	6.1 chars
Compression	3.17 chars/token

Training

Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%.

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}

Downloads last month: -; Downloads are not tracked for this model. How to track