File size: 4,698 Bytes
bbf1368 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
language:
- en
- de
- fr
- es
- pt
- it
- nl
- pl
- ro
- cs
- sv
- da
- "no"
- fi
- hu
- hr
- bg
- tr
- ca
- ru
- uk
- sr
- zh
- ja
- ko
- ar
- fa
- he
- hi
- bn
- th
- vi
- ka
- hy
- el
- yi
- ur
- ta
- te
- gu
- pa
- ml
- kn
- am
- si
- my
- km
- mr
- ne
- or
- bo
- dv
- eu
- gl
- gd
- et
- sk
- lt
- sl
- lv
- af
- sq
- sw
- is
- tl
- cy
- ga
- br
- la
- mk
- id
license: apache-2.0
library_name: tokenizers
tags:
- tokenizer
- bpe
- multilingual
- quartz
- aenea
- flores
pipeline_tag: text-generation
---
# QT_V.2 96K — Best All-Round Multilingual Tokenizer
**Fewest total tokens on FLORES-200 of any tokenizer tested.** 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors.
Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).
## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT 96K | QT Code 114K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| **Total tokens** | **12,961,617** ✓ | 13,007,924 | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 |
| **Equity ratio** | **31.6×** ✓ | 43.3× | 41.0× | 118.6× | 127.9× | 77.7× |
| Mean fertility | **3.94** ✓ | 4.03 | 4.18 | 5.72 | 5.34 | 4.91 |
| Worst language | lao (43.0) | lao (58.0) | lao (58.0) | bod (149.8) | bod (168.4) | bod (98.0) |
**QT 96K wins on total tokens, equity, and mean fertility.** The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is **3.6× more expensive** than QT 96K's Tibetan (41.1 tok/word).
### Script Family Averages (FLORES-200 tok/word)
| Script Family | QT 96K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Latin (37 langs) | **2.20** | 2.39 | 2.20 | 2.41 |
| Cyrillic (5) | **2.23** | 2.59 | 2.27 | 2.99 |
| CJK (4) | 17.17 | 19.75 | 21.36 | **17.26** |
| Indic Other (9) | **4.21** | 12.42 | 6.77 | 10.37 |
| SE Asian (4) | **20.70** | 31.08 | 38.22 | 24.04 |
| Unique Scripts (6) | **9.35** | 32.96 | 32.05 | 21.39 |
QT 96K is **3× more efficient** than Llama 3 on Indic languages, and **3.4× more efficient** on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek).
## Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| **Total tokens** | **3,339** |
| vs Llama 3 (128K) | 40.8% fewer tokens |
| vs Tekken (131K) | 23.2% fewer tokens |
| vs Qwen 2.5 (152K) | 35.6% fewer tokens |
Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken).
## When to Use This Variant
**QT_V.2 96K** is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations.
Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding)
## Usage
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)
```
## Specifications
| Spec | Value |
|---|---|
| Vocabulary | 96,000 |
| Languages | 71 natural + 14 code |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 6.1 chars |
| Compression | 3.17 chars/token |
## Training
Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%.
## Files
`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`
## Contact
Open-source: quartzopensource@gmail.com
Commercial licensing & enterprise: commercial@aeneaglobal.com
## License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
```bibtex
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}
```
|