--- language: - en - de - fr - es - pt - it - nl - pl - ro - cs - sv - da - "no" - fi - hu - hr - bg - tr - ca - ru - uk - sr - zh - ja - ko - ar - fa - he - hi - bn - th - vi - ka - hy - el - yi - ur - ta - te - gu - pa - ml - kn - am - si - my - km - mr - ne - or - bo - dv - eu - gl - gd - et - sk - lt - sl - lv - af - sq - sw - is - tl - cy - ga - br - la - mk - id license: apache-2.0 library_name: tokenizers tags: - tokenizer - bpe - multilingual - quartz - aenea - flores pipeline_tag: text-generation --- # QT_V.2 96K — Best All-Round Multilingual Tokenizer **Fewest total tokens on FLORES-200 of any tokenizer tested.** 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors. Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app). ## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences) | Metric | QT 96K | QT Code 114K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) | |---|---|---|---|---|---|---| | **Total tokens** | **12,961,617** ✓ | 13,007,924 | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 | | **Equity ratio** | **31.6×** ✓ | 43.3× | 41.0× | 118.6× | 127.9× | 77.7× | | Mean fertility | **3.94** ✓ | 4.03 | 4.18 | 5.72 | 5.34 | 4.91 | | Worst language | lao (43.0) | lao (58.0) | lao (58.0) | bod (149.8) | bod (168.4) | bod (98.0) | **QT 96K wins on total tokens, equity, and mean fertility.** The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is **3.6× more expensive** than QT 96K's Tibetan (41.1 tok/word). ### Script Family Averages (FLORES-200 tok/word) | Script Family | QT 96K | Llama 3 | Tekken | Qwen 2.5 | |---|---|---|---|---| | Latin (37 langs) | **2.20** | 2.39 | 2.20 | 2.41 | | Cyrillic (5) | **2.23** | 2.59 | 2.27 | 2.99 | | CJK (4) | 17.17 | 19.75 | 21.36 | **17.26** | | Indic Other (9) | **4.21** | 12.42 | 6.77 | 10.37 | | SE Asian (4) | **20.70** | 31.08 | 38.22 | 24.04 | | Unique Scripts (6) | **9.35** | 32.96 | 32.05 | 21.39 | QT 96K is **3× more efficient** than Llama 3 on Indic languages, and **3.4× more efficient** on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek). ## Field Benchmark (66 Tests) | Metric | Value | |---|---| | **Total tokens** | **3,339** | | vs Llama 3 (128K) | 40.8% fewer tokens | | vs Tekken (131K) | 23.2% fewer tokens | | vs Qwen 2.5 (152K) | 35.6% fewer tokens | Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken). ## When to Use This Variant **QT_V.2 96K** is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations. Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding) ## Usage ```python from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizer.json") encoded = tok.encode("The quick brown fox jumps over the lazy dog") print(encoded.tokens) ``` ## Specifications | Spec | Value | |---|---| | Vocabulary | 96,000 | | Languages | 71 natural + 14 code | | Script families | 26 | | Pretokenizer | Llama 3 regex | | Arithmetic | Single-digit splitting | | Max token length | 15 chars | | Avg token length | 6.1 chars | | Compression | 3.17 chars/token | ## Training Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%. ## Files `tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json` ## Contact Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com ## License Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd ```bibtex @misc{qt_v2_2026, title={QT_V.2: A Multilingual BPE Tokenizer Family}, author={AENEA Global Ltd}, year={2026}, url={https://quartz.host}, } ```