--- language: - en - de - fr - es - pt - it - nl - pl - ro - cs - sv - da - "no" - fi - hu - hr - bg - tr - ca - ru - uk - sr - zh - ja - ko - ar - fa - he - hi - bn - th - vi - ka - hy - el - yi - ur - ta - te - gu - pa - ml - kn - am - si - my - km - mr - ne - or - bo - dv - eu - gl - gd - et - sk - lt - sl - lv - af - sq - sw - is - tl - cy - ga - br - la - mk - id license: apache-2.0 library_name: tokenizers tags: - tokenizer - bpe - multilingual - quartz - aenea - flores pipeline_tag: text-generation --- # QT_V.2 64K — Multilingual BPE Tokenizer **The most equitable 64K tokenizer available.** 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark. Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app). ## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences) | Metric | QT 64K | QT 96K | QT Code 114K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) | |---|---|---|---|---|---|---| | **Total tokens** | 13,592,357 | **12,961,617** | 13,007,924 | 16,764,198 | 14,421,539 | 15,425,680 | | **Equity ratio** | **41.0×** | **31.6×** | 43.3× | 118.6× | 127.9× | 77.7× | | Mean fertility | 4.18 | 3.94 | 4.03 | 5.72 | 5.34 | 4.91 | The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is **2.9× more equitable than Llama 3** (118.6×) and **3.1× more equitable than Tekken** (127.9×) — at half the vocabulary. ### Where QT 64K Dominates (FLORES-200 tok/word) | Language | QT 64K | Llama 3 | Tekken | Qwen 2.5 | |---|---|---|---|---| | Tibetan | **42.5** | 149.8 | 168.4 | 98.0 | | Odia | **4.16** | 16.90 | 18.30 | 13.65 | | Khmer | **17.1** | 40.9 | 70.5 | 30.7 | | Georgian | **3.83** | 15.47 | 3.93 | 8.33 | | Sinhala | **3.84** | 11.37 | 16.60 | 9.17 | | Amharic | **3.90** | 11.95 | 11.98 | 6.45 | ## Field Benchmark (66 Tests) | Metric | Value | |---|---| | **Total tokens** | **3,593** | | vs Llama 3 (128K) | 36.3% fewer tokens | | vs Tekken (131K) | 17.3% fewer tokens | | vs Qwen 2.5 (152K) | 30.7% fewer tokens | ## When to Use This Variant **QT_V.2 64K** is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters. Also available: [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding) ## Usage ```python from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizer.json") encoded = tok.encode("The quick brown fox jumps over the lazy dog") print(encoded.tokens) ``` ## Specifications | Spec | Value | |---|---| | Vocabulary | 64,000 | | Languages | 71 natural + 14 code | | Script families | 26 | | Pretokenizer | Llama 3 regex | | Arithmetic | Single-digit splitting | | Max token length | 15 chars | | Avg token length | 5.7 chars | ## Training Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1). ## Files `tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json` ## Contact Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com ## License Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd ```bibtex @misc{qt_v2_2026, title={QT_V.2: A Multilingual BPE Tokenizer Family}, author={AENEA Global Ltd}, year={2026}, url={https://quartz.host}, } ```