File size: 5,218 Bytes
7098035 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | ---
language:
- en
- de
- fr
- es
- pt
- it
- nl
- pl
- ro
- cs
- sv
- da
- "no"
- fi
- hu
- hr
- bg
- tr
- ca
- ru
- uk
- sr
- zh
- ja
- ko
- ar
- fa
- he
- hi
- bn
- th
- vi
- ka
- hy
- el
- yi
- ur
- ta
- te
- gu
- pa
- ml
- kn
- am
- si
- my
- km
- mr
- ne
- or
- bo
- dv
- eu
- gl
- gd
- et
- sk
- lt
- sl
- lv
- af
- sq
- sw
- is
- tl
- cy
- ga
- br
- la
- mk
- id
- code
license: apache-2.0
library_name: tokenizers
tags:
- tokenizer
- bpe
- multilingual
- code
- quartz
- aenea
- coding
- python
- flores
pipeline_tag: text-generation
---
# QT_V.2 Code 114K — Multilingual Coding Tokenizer
**Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size.** 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.
Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).
## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)
| Metric | QT Code 114K | QT 96K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| **Total tokens** | 13,007,924 | **12,961,617** | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 |
| **Equity ratio** | 43.3× | **31.6×** | 41.0× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 4.03 | **3.94** | 4.18 | 5.72 | 5.34 | 4.91 |
QT Code 114K uses **22.4% fewer tokens than Llama 3** and **9.8% fewer than Tekken** across all 204 FLORES languages — with 10–37% less vocabulary.
### Key FLORES Languages (tok/word)
| Language | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Japanese | **32.1** | 38.9 | 41.3 | 35.8 |
| Tibetan | **46.5** | 149.8 | 168.4 | 98.0 |
| Sinhala | **3.58** | 11.37 | 16.60 | 9.17 |
| Amharic | **3.40** | 11.95 | 11.98 | 6.45 |
| Georgian | **3.46** | 15.47 | 3.93 | 8.33 |
| Odia | **4.10** | 16.90 | 18.30 | 13.65 |
## Field Benchmark (66 Tests)
| Metric | Value |
|---|---|
| **Total tokens** | **3,314** (lowest of any tokenizer) |
| vs Llama 3 (128K) | 41.2% fewer tokens |
| vs Tekken (131K) | 23.8% fewer tokens |
| vs Qwen 2.5 (152K) | 36.1% fewer tokens |
### Code Performance
| Language | QT Code | QT 96K | QT 64K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|---|---|
| Python | **110** | 115 | 125 | 97 | 112 | 105 |
| JavaScript | **67** | 71 | 71 | 65 | 69 | 64 |
| Rust | **111** | 113 | 117 | 108 | 111 | 107 |
Python compression improved from 125 (64K) to 115 (96K) to **110** (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.
### Category Totals (lower is better)
| Category | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Natural Languages (20) | **1,033** | 1,599 | 1,038 | 1,535 |
| V1 Expansion (14) | **662** | 1,758 | 1,092 | 1,509 |
| V2 New Scripts (3) | **188** | 692 | 740 | 523 |
| Celtic / Brythonic (8) | **312** | 391 | 341 | 384 |
| Code (3) | 288 | **270** | 292 | 276 |
| **TOTAL (66 tests)** | **3,314** | 5,639 | 4,347 | 5,183 |
## When to Use This Variant
**QT_V.2 Code 114K** is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.
Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round)
## Usage
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)")
print(encoded.tokens)
```
## Specifications
| Spec | Value |
|---|---|
| Vocabulary | 114,688 |
| Languages | 71 natural + 15 code (incl. CodeSearchNet) |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 6.24 chars |
| Compression | 3.60 chars/token |
## Training
Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:
| Category | Share | Sources |
|---|---|---|
| Wikipedia | 37.3% | 71 languages (wiki_ultra_clean v7.3) |
| Code | 37.4% | 14 languages + CodeSearchNet Python (450K functions) |
| Stack Exchange | 25.3% | 49 sites (se_ultra_clean v1) |
## Files
`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`
## Contact
Open-source: quartzopensource@gmail.com
Commercial licensing & enterprise: commercial@aeneaglobal.com
## License
Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd
```bibtex
@misc{qt_v2_2026,
title={QT_V.2: A Multilingual BPE Tokenizer Family},
author={AENEA Global Ltd},
year={2026},
url={https://quartz.host},
}
```
|