File size: 3,913 Bytes
bfe5f8e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
language:
- en
- de
- fr
- es
- pt
- it
- nl
- pl
- ro
- cs
- sv
- da
- "no"
- fi
- hu
- hr
- bg
- tr
- ca
- ru
- uk
- sr
- zh
- ja
- ko
- ar
- fa
- he
- hi
- bn
- th
- vi
- ka
- hy
- el
- yi
- ur
- ta
- te
- gu
- pa
- ml
- kn
- am
- si
- my
- km
- mr
- ne
- or
- bo
- dv
- eu
- gl
- gd
- et
- sk
- lt
- sl
- lv
- af
- sq
- sw
- is
- tl
- cy
- ga
- br
- la
- mk
- id
license: apache-2.0
library_name: tokenizers
tags:
- tokenizer
- bpe
- multilingual
- quartz
- aenea
- flores
pipeline_tag: text-generation
---
# QT_V.2 64K — Multilingual BPE Tokenizer
**The most equitable 64K tokenizer available.** 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark.
Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).
## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT 64K | QT 96K | QT Code 114K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| **Total tokens** | 13,592,357 | **12,961,617** | 13,007,924 | 16,764,198 | 14,421,539 | 15,425,680 |
| **Equity ratio** | **41.0×** | **31.6×** | 43.3× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 4.18 | 3.94 | 4.03 | 5.72 | 5.34 | 4.91 |
The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is **2.9× more equitable than Llama 3** (118.6×) and **3.1× more equitable than Tekken** (127.9×) — at half the vocabulary.
### Where QT 64K Dominates (FLORES-200 tok/word)
| Language | QT 64K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Tibetan | **42.5** | 149.8 | 168.4 | 98.0 |
| Odia | **4.16** | 16.90 | 18.30 | 13.65 |
| Khmer | **17.1** | 40.9 | 70.5 | 30.7 |
| Georgian | **3.83** | 15.47 | 3.93 | 8.33 |
| Sinhala | **3.84** | 11.37 | 16.60 | 9.17 |
| Amharic | **3.90** | 11.95 | 11.98 | 6.45 |
## Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| **Total tokens** | **3,593** |
| vs Llama 3 (128K) | 36.3% fewer tokens |
| vs Tekken (131K) | 17.3% fewer tokens |
| vs Qwen 2.5 (152K) | 30.7% fewer tokens |
## When to Use This Variant
**QT_V.2 64K** is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters.
Also available: [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding)
## Usage
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("The quick brown fox jumps over the lazy dog")
print(encoded.tokens)
```
## Specifications
| Spec | Value |
|---|---|
| Vocabulary | 64,000 |
| Languages | 71 natural + 14 code |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 5.7 chars |
## Training
Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1).
## Files
`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`
## Contact
Open-source: quartzopensource@gmail.com
Commercial licensing & enterprise: commercial@aeneaglobal.com
## License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
```bibtex
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}
```
|