JamesQuartz commited on
Commit
b702ea3
·
verified ·
1 Parent(s): 99fef80

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +226 -0
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - ja
6
+ - fr
7
+ - es
8
+ - ru
9
+ - it
10
+ - zh
11
+ - he
12
+ - pt
13
+ - ko
14
+ - ar
15
+ - nl
16
+ - pl
17
+ - uk
18
+ - ta
19
+ - cs
20
+ - te
21
+ - th
22
+ - fa
23
+ - bn
24
+ - hu
25
+ - hi
26
+ - sv
27
+ - el
28
+ - fi
29
+ - id
30
+ - vi
31
+ - hy
32
+ - ro
33
+ - 'no'
34
+ - sr
35
+ - tr
36
+ - bg
37
+ - da
38
+ - gl
39
+ - ka
40
+ - mr
41
+ - pa
42
+ - sl
43
+ - et
44
+ - hr
45
+ - kn
46
+ - my
47
+ - sk
48
+ - ur
49
+ - af
50
+ - lt
51
+ - lv
52
+ - ne
53
+ - or
54
+ - si
55
+ - sq
56
+ - yi
57
+ - am
58
+ - bo
59
+ - br
60
+ - ca
61
+ - cy
62
+ - dv
63
+ - eu
64
+ - ga
65
+ - gd
66
+ - gu
67
+ - is
68
+ - km
69
+ - la
70
+ - mk
71
+ - ml
72
+ - sw
73
+ - tl
74
+ license: apache-2.0
75
+ library_name: tokenizers
76
+ tags:
77
+ - tokenizer
78
+ - multilingual
79
+ - superbpe
80
+ - bpe
81
+ - byte-level
82
+ - quartz
83
+ - aenea
84
+ - ultralingo
85
+ ---
86
+
87
+ # QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer
88
+
89
+ **The most equitable small-vocabulary multilingual tokenizer available.**
90
+
91
+ A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering **71 languages across 26 writing systems**. Designed for parameter-efficient small language models (sub-500M parameters) in the [AENEA](https://aenea.app) model family.
92
+
93
+ ## Key Results (FLORES-200 Benchmark, 204 languages)
94
+
95
+ | Metric | QT V.3 32K | QT V.2 96K | Llama 3 128K |
96
+ |--------|-----------|------------|-------------|
97
+ | **Vocab size** | 32,000 | 96,000 | 128,256 |
98
+ | **Mean fertility** | 4.354 | 3.942 | 5.716 |
99
+ | **Median fertility** | 2.792 | 2.574 | 2.700 |
100
+ | **Equity ratio** | 38.7× | 31.6× | 118.6× |
101
+ | **Embedding params (d=1024)** | 33M | 98M | 131M |
102
+
103
+ - **Beats Llama 3 (128K vocab) on 48/204 languages** with ¼ of the vocabulary
104
+ - **Beats QT V.2 96K on 24/204 languages** — particularly Indic and SE Asian scripts
105
+ - **Within 15% of QT V.2 96K on 145/204 languages** despite ⅓ of the vocabulary
106
+ - **3× better equity** than Llama 3 (38.7× vs 118.6×)
107
+
108
+ ## Script Family Performance (tokens/word, lower is better)
109
+
110
+ | Script | QT V.3 32K | QT V.2 96K | Llama 3 128K |
111
+ |--------|-----------|------------|-------------|
112
+ | Latin | 1.92 | 1.63 | 1.72 |
113
+ | Cyrillic | 2.83 | 2.24 | 2.43 |
114
+ | CJK | 21.54 | 17.25 | 19.64 |
115
+ | Arabic | 2.63 | 2.15 | 2.34 |
116
+ | **Indic** | **3.41** | 3.94 | 9.15 |
117
+ | **SE Asian** | **12.91** | 13.29 | 28.24 |
118
+
119
+ QT V.3 32K **outperforms tokenizers 3-4× its size** on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.
120
+
121
+ ## What is SuperBPE?
122
+
123
+ SuperBPE ([Liu et al., COLM 2025](https://arxiv.org/abs/2503.13423)) is a two-stage extension of BPE that allows tokens to span across word boundaries:
124
+
125
+ - **Stage 1 (Subword):** Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
126
+ - **Stage 2 (Superword):** Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)
127
+
128
+ The ~3,200 superword tokens improved fertility by **25% on Tamil**, **19% on Malayalam**, **18% on Myanmar**, and **17% on Hindi and Thai** compared to Stage 1 alone.
129
+
130
+ ## Design Innovations
131
+
132
+ 1. **SuperBPE two-stage training** — first open multilingual SuperBPE tokenizer
133
+ 2. **√-proportional language weighting** with 0.3% floor per language — ensures every script family gets minimum representation
134
+ 3. **71 languages, 26 scripts** in a 32K vocabulary — parameter-efficient for small models
135
+ 4. **Single-digit splitting** — each digit tokenized individually for arithmetic reasoning ([Singh & Strouse, ICLR 2025](https://arxiv.org/abs/2305.14201))
136
+ 5. **85 special tokens** including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning
137
+
138
+ ## Special Tokens
139
+
140
+ | Token | ID | Purpose |
141
+ |-------|-----|---------|
142
+ | `<\|padding\|>` | 0 | Padding |
143
+ | `<\|bos\|>` | 1 | Beginning of sequence |
144
+ | `<\|endoftext\|>` | 2 | End of text / EOS |
145
+ | `<\|system\|>` | 5 | System prompt |
146
+ | `<\|user\|>` | 6 | User turn |
147
+ | `<\|assistant\|>` | 7 | Assistant turn |
148
+ | `<\|thinking\|>` | 10 | Reasoning start |
149
+ | `<\|lang:XX\|>` | 14-84 | Language tags (71 languages) |
150
+
151
+ ## Usage
152
+
153
+ ```python
154
+ from tokenizers import Tokenizer
155
+
156
+ tok = Tokenizer.from_file("tokenizer.json")
157
+ # or
158
+ tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")
159
+
160
+ encoded = tok.encode("The history of mathematics began in ancient civilizations.")
161
+ print(encoded.tokens)
162
+ print(encoded.ids)
163
+
164
+ # Multilingual
165
+ encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
166
+ encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
167
+ encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")
168
+ ```
169
+
170
+ ## Languages (71)
171
+
172
+ **Tier 1 — Primary:** English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean
173
+
174
+ **Tier 2 — Important:** Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese
175
+
176
+ **Tier 3 — Coverage:** Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish
177
+
178
+ **Tier 4 — Minimal:** Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog
179
+
180
+ ## Scripts (26)
181
+
182
+ Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek
183
+
184
+ ## Training Details
185
+
186
+ - **Algorithm:** SuperBPE (two-stage byte-level BPE)
187
+ - **Pre-tokenization:** LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
188
+ - **SuperBPE transition:** 90% subword → 10% superword
189
+ - **Training data:** Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by [wiki_ultra_clean v7.2](https://github.com/QuartzOpen/quartz-clean)
190
+ - **Language weighting:** √-proportional with 0.3% minimum floor per language
191
+ - **Normalization:** None (lossless round-trip encoding)
192
+ - **Byte fallback:** Full 256-byte coverage via ByteLevel encoding
193
+
194
+ ## Embedding Parameter Savings
195
+
196
+ | Model Scale | QT V.3 32K | QT V.2 96K | Llama 3 128K | V.3 Savings |
197
+ |-------------|-----------|------------|-------------|------------|
198
+ | d=1024 (Prelude) | 33M | 98M | 131M | **65M fewer** |
199
+ | d=2048 (1B) | 66M | 197M | 263M | **131M fewer** |
200
+ | d=4096 (7B) | 131M | 393M | 525M | **262M fewer** |
201
+
202
+ Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.
203
+
204
+ ## References
205
+
206
+ - Liu et al. (2025) "[SuperBPE: Space Travel for Language Models](https://arxiv.org/abs/2503.13423)" — COLM 2025
207
+ - Tao et al. (2024) "[Scaling Laws with Vocabulary](https://proceedings.neurips.cc/paper_files/paper/2024/hash/cf5a019ae9c11b4be88213ce3f85d85c-Abstract-Conference.html)" — NeurIPS 2024
208
+ - "The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
209
+ - "IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
210
+ - "The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
211
+ - Singh & Strouse (2025) "[Tokenization Counts](https://arxiv.org/abs/2305.14201)" — ICLR 2025 (single-digit splitting)
212
+
213
+ ## Part of the Quartz Tokenizer Family
214
+
215
+ | Tokenizer | Vocab | Target | Status |
216
+ |-----------|-------|--------|--------|
217
+ | QT V.2 64K | 64,000 | General multilingual | Released |
218
+ | QT V.2 96K | 96,000 | Extended multilingual | Released |
219
+ | QT V.2 Code 114K | 114,000 | Code + multilingual | Released |
220
+ | **QT V.3 32K UltraLingo** | **32,000** | **Parameter-efficient SuperBPE** | **New** |
221
+
222
+ ---
223
+
224
+ *Built by [Quartz Data Infrastructure](https://quartz.host) for the [AENEA](https://aenea.app) model family.*
225
+
226
+ *QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.*