File size: 5,218 Bytes
7098035
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
language:
  - en
  - de
  - fr
  - es
  - pt
  - it
  - nl
  - pl
  - ro
  - cs
  - sv
  - da
  - "no"
  - fi
  - hu
  - hr
  - bg
  - tr
  - ca
  - ru
  - uk
  - sr
  - zh
  - ja
  - ko
  - ar
  - fa
  - he
  - hi
  - bn
  - th
  - vi
  - ka
  - hy
  - el
  - yi
  - ur
  - ta
  - te
  - gu
  - pa
  - ml
  - kn
  - am
  - si
  - my
  - km
  - mr
  - ne
  - or
  - bo
  - dv
  - eu
  - gl
  - gd
  - et
  - sk
  - lt
  - sl
  - lv
  - af
  - sq
  - sw
  - is
  - tl
  - cy
  - ga
  - br
  - la
  - mk
  - id
  - code
license: apache-2.0
library_name: tokenizers
tags:
  - tokenizer
  - bpe
  - multilingual
  - code
  - quartz
  - aenea
  - coding
  - python
  - flores
pipeline_tag: text-generation
---

# QT_V.2 Code 114K — Multilingual Coding Tokenizer

**Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size.** 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.

Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).

## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

| Metric | QT Code 114K | QT 96K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) |
|---|---|---|---|---|---|---|
| **Total tokens** | 13,007,924 | **12,961,617** | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 |
| **Equity ratio** | 43.3× | **31.6×** | 41.0× | 118.6× | 127.9× | 77.7× |
| Mean fertility | 4.03 | **3.94** | 4.18 | 5.72 | 5.34 | 4.91 |

QT Code 114K uses **22.4% fewer tokens than Llama 3** and **9.8% fewer than Tekken** across all 204 FLORES languages — with 10–37% less vocabulary.

### Key FLORES Languages (tok/word)

| Language | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Japanese | **32.1** | 38.9 | 41.3 | 35.8 |
| Tibetan | **46.5** | 149.8 | 168.4 | 98.0 |
| Sinhala | **3.58** | 11.37 | 16.60 | 9.17 |
| Amharic | **3.40** | 11.95 | 11.98 | 6.45 |
| Georgian | **3.46** | 15.47 | 3.93 | 8.33 |
| Odia | **4.10** | 16.90 | 18.30 | 13.65 |

## Field Benchmark (66 Tests)

| Metric | Value |
|---|---|
| **Total tokens** | **3,314** (lowest of any tokenizer) |
| vs Llama 3 (128K) | 41.2% fewer tokens |
| vs Tekken (131K) | 23.8% fewer tokens |
| vs Qwen 2.5 (152K) | 36.1% fewer tokens |

### Code Performance

| Language | QT Code | QT 96K | QT 64K | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|---|---|
| Python | **110** | 115 | 125 | 97 | 112 | 105 |
| JavaScript | **67** | 71 | 71 | 65 | 69 | 64 |
| Rust | **111** | 113 | 117 | 108 | 111 | 107 |

Python compression improved from 125 (64K) to 115 (96K) to **110** (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.

### Category Totals (lower is better)

| Category | QT Code | Llama 3 | Tekken | Qwen 2.5 |
|---|---|---|---|---|
| Natural Languages (20) | **1,033** | 1,599 | 1,038 | 1,535 |
| V1 Expansion (14) | **662** | 1,758 | 1,092 | 1,509 |
| V2 New Scripts (3) | **188** | 692 | 740 | 523 |
| Celtic / Brythonic (8) | **312** | 391 | 341 | 384 |
| Code (3) | 288 | **270** | 292 | 276 |
| **TOTAL (66 tests)** | **3,314** | 5,639 | 4,347 | 5,183 |

## When to Use This Variant

**QT_V.2 Code 114K** is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.

Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round)

## Usage

```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)")
print(encoded.tokens)
```

## Specifications

| Spec | Value |
|---|---|
| Vocabulary | 114,688 |
| Languages | 71 natural + 15 code (incl. CodeSearchNet) |
| Script families | 26 |
| Pretokenizer | Llama 3 regex |
| Arithmetic | Single-digit splitting |
| Max token length | 15 chars |
| Avg token length | 6.24 chars |
| Compression | 3.60 chars/token |

## Training

Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:

| Category | Share | Sources |
|---|---|---|
| Wikipedia | 37.3% | 71 languages (wiki_ultra_clean v7.3) |
| Code | 37.4% | 14 languages + CodeSearchNet Python (450K functions) |
| Stack Exchange | 25.3% | 49 sites (se_ultra_clean v1) |

## Files

`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`

## Contact

Open-source: quartzopensource@gmail.com
Commercial licensing & enterprise: commercial@aeneaglobal.com

## License

Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

```bibtex
@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}
```