Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- de
|
| 5 |
+
- ja
|
| 6 |
+
- fr
|
| 7 |
+
- es
|
| 8 |
+
- ru
|
| 9 |
+
- it
|
| 10 |
+
- zh
|
| 11 |
+
- he
|
| 12 |
+
- pt
|
| 13 |
+
- ko
|
| 14 |
+
- ar
|
| 15 |
+
- nl
|
| 16 |
+
- pl
|
| 17 |
+
- uk
|
| 18 |
+
- ta
|
| 19 |
+
- cs
|
| 20 |
+
- te
|
| 21 |
+
- th
|
| 22 |
+
- fa
|
| 23 |
+
- bn
|
| 24 |
+
- hu
|
| 25 |
+
- hi
|
| 26 |
+
- sv
|
| 27 |
+
- el
|
| 28 |
+
- fi
|
| 29 |
+
- id
|
| 30 |
+
- vi
|
| 31 |
+
- hy
|
| 32 |
+
- ro
|
| 33 |
+
- 'no'
|
| 34 |
+
- sr
|
| 35 |
+
- tr
|
| 36 |
+
- bg
|
| 37 |
+
- da
|
| 38 |
+
- gl
|
| 39 |
+
- ka
|
| 40 |
+
- mr
|
| 41 |
+
- pa
|
| 42 |
+
- sl
|
| 43 |
+
- et
|
| 44 |
+
- hr
|
| 45 |
+
- kn
|
| 46 |
+
- my
|
| 47 |
+
- sk
|
| 48 |
+
- ur
|
| 49 |
+
- af
|
| 50 |
+
- lt
|
| 51 |
+
- lv
|
| 52 |
+
- ne
|
| 53 |
+
- or
|
| 54 |
+
- si
|
| 55 |
+
- sq
|
| 56 |
+
- yi
|
| 57 |
+
- am
|
| 58 |
+
- bo
|
| 59 |
+
- br
|
| 60 |
+
- ca
|
| 61 |
+
- cy
|
| 62 |
+
- dv
|
| 63 |
+
- eu
|
| 64 |
+
- ga
|
| 65 |
+
- gd
|
| 66 |
+
- gu
|
| 67 |
+
- is
|
| 68 |
+
- km
|
| 69 |
+
- la
|
| 70 |
+
- mk
|
| 71 |
+
- ml
|
| 72 |
+
- sw
|
| 73 |
+
- tl
|
| 74 |
+
license: apache-2.0
|
| 75 |
+
library_name: tokenizers
|
| 76 |
+
tags:
|
| 77 |
+
- tokenizer
|
| 78 |
+
- multilingual
|
| 79 |
+
- superbpe
|
| 80 |
+
- bpe
|
| 81 |
+
- byte-level
|
| 82 |
+
- quartz
|
| 83 |
+
- aenea
|
| 84 |
+
- ultralingo
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
# QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer
|
| 88 |
+
|
| 89 |
+
**The most equitable small-vocabulary multilingual tokenizer available.**
|
| 90 |
+
|
| 91 |
+
A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering **71 languages across 26 writing systems**. Designed for parameter-efficient small language models (sub-500M parameters) in the [AENEA](https://aenea.app) model family.
|
| 92 |
+
|
| 93 |
+
## Key Results (FLORES-200 Benchmark, 204 languages)
|
| 94 |
+
|
| 95 |
+
| Metric | QT V.3 32K | QT V.2 96K | Llama 3 128K |
|
| 96 |
+
|--------|-----------|------------|-------------|
|
| 97 |
+
| **Vocab size** | 32,000 | 96,000 | 128,256 |
|
| 98 |
+
| **Mean fertility** | 4.354 | 3.942 | 5.716 |
|
| 99 |
+
| **Median fertility** | 2.792 | 2.574 | 2.700 |
|
| 100 |
+
| **Equity ratio** | 38.7× | 31.6× | 118.6× |
|
| 101 |
+
| **Embedding params (d=1024)** | 33M | 98M | 131M |
|
| 102 |
+
|
| 103 |
+
- **Beats Llama 3 (128K vocab) on 48/204 languages** with ¼ of the vocabulary
|
| 104 |
+
- **Beats QT V.2 96K on 24/204 languages** — particularly Indic and SE Asian scripts
|
| 105 |
+
- **Within 15% of QT V.2 96K on 145/204 languages** despite ⅓ of the vocabulary
|
| 106 |
+
- **3× better equity** than Llama 3 (38.7× vs 118.6×)
|
| 107 |
+
|
| 108 |
+
## Script Family Performance (tokens/word, lower is better)
|
| 109 |
+
|
| 110 |
+
| Script | QT V.3 32K | QT V.2 96K | Llama 3 128K |
|
| 111 |
+
|--------|-----------|------------|-------------|
|
| 112 |
+
| Latin | 1.92 | 1.63 | 1.72 |
|
| 113 |
+
| Cyrillic | 2.83 | 2.24 | 2.43 |
|
| 114 |
+
| CJK | 21.54 | 17.25 | 19.64 |
|
| 115 |
+
| Arabic | 2.63 | 2.15 | 2.34 |
|
| 116 |
+
| **Indic** | **3.41** | 3.94 | 9.15 |
|
| 117 |
+
| **SE Asian** | **12.91** | 13.29 | 28.24 |
|
| 118 |
+
|
| 119 |
+
QT V.3 32K **outperforms tokenizers 3-4× its size** on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.
|
| 120 |
+
|
| 121 |
+
## What is SuperBPE?
|
| 122 |
+
|
| 123 |
+
SuperBPE ([Liu et al., COLM 2025](https://arxiv.org/abs/2503.13423)) is a two-stage extension of BPE that allows tokens to span across word boundaries:
|
| 124 |
+
|
| 125 |
+
- **Stage 1 (Subword):** Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
|
| 126 |
+
- **Stage 2 (Superword):** Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)
|
| 127 |
+
|
| 128 |
+
The ~3,200 superword tokens improved fertility by **25% on Tamil**, **19% on Malayalam**, **18% on Myanmar**, and **17% on Hindi and Thai** compared to Stage 1 alone.
|
| 129 |
+
|
| 130 |
+
## Design Innovations
|
| 131 |
+
|
| 132 |
+
1. **SuperBPE two-stage training** — first open multilingual SuperBPE tokenizer
|
| 133 |
+
2. **√-proportional language weighting** with 0.3% floor per language — ensures every script family gets minimum representation
|
| 134 |
+
3. **71 languages, 26 scripts** in a 32K vocabulary — parameter-efficient for small models
|
| 135 |
+
4. **Single-digit splitting** — each digit tokenized individually for arithmetic reasoning ([Singh & Strouse, ICLR 2025](https://arxiv.org/abs/2305.14201))
|
| 136 |
+
5. **85 special tokens** including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning
|
| 137 |
+
|
| 138 |
+
## Special Tokens
|
| 139 |
+
|
| 140 |
+
| Token | ID | Purpose |
|
| 141 |
+
|-------|-----|---------|
|
| 142 |
+
| `<\|padding\|>` | 0 | Padding |
|
| 143 |
+
| `<\|bos\|>` | 1 | Beginning of sequence |
|
| 144 |
+
| `<\|endoftext\|>` | 2 | End of text / EOS |
|
| 145 |
+
| `<\|system\|>` | 5 | System prompt |
|
| 146 |
+
| `<\|user\|>` | 6 | User turn |
|
| 147 |
+
| `<\|assistant\|>` | 7 | Assistant turn |
|
| 148 |
+
| `<\|thinking\|>` | 10 | Reasoning start |
|
| 149 |
+
| `<\|lang:XX\|>` | 14-84 | Language tags (71 languages) |
|
| 150 |
+
|
| 151 |
+
## Usage
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
from tokenizers import Tokenizer
|
| 155 |
+
|
| 156 |
+
tok = Tokenizer.from_file("tokenizer.json")
|
| 157 |
+
# or
|
| 158 |
+
tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")
|
| 159 |
+
|
| 160 |
+
encoded = tok.encode("The history of mathematics began in ancient civilizations.")
|
| 161 |
+
print(encoded.tokens)
|
| 162 |
+
print(encoded.ids)
|
| 163 |
+
|
| 164 |
+
# Multilingual
|
| 165 |
+
encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
|
| 166 |
+
encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
|
| 167 |
+
encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
## Languages (71)
|
| 171 |
+
|
| 172 |
+
**Tier 1 — Primary:** English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean
|
| 173 |
+
|
| 174 |
+
**Tier 2 — Important:** Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese
|
| 175 |
+
|
| 176 |
+
**Tier 3 — Coverage:** Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish
|
| 177 |
+
|
| 178 |
+
**Tier 4 — Minimal:** Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog
|
| 179 |
+
|
| 180 |
+
## Scripts (26)
|
| 181 |
+
|
| 182 |
+
Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek
|
| 183 |
+
|
| 184 |
+
## Training Details
|
| 185 |
+
|
| 186 |
+
- **Algorithm:** SuperBPE (two-stage byte-level BPE)
|
| 187 |
+
- **Pre-tokenization:** LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
|
| 188 |
+
- **SuperBPE transition:** 90% subword → 10% superword
|
| 189 |
+
- **Training data:** Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by [wiki_ultra_clean v7.2](https://github.com/QuartzOpen/quartz-clean)
|
| 190 |
+
- **Language weighting:** √-proportional with 0.3% minimum floor per language
|
| 191 |
+
- **Normalization:** None (lossless round-trip encoding)
|
| 192 |
+
- **Byte fallback:** Full 256-byte coverage via ByteLevel encoding
|
| 193 |
+
|
| 194 |
+
## Embedding Parameter Savings
|
| 195 |
+
|
| 196 |
+
| Model Scale | QT V.3 32K | QT V.2 96K | Llama 3 128K | V.3 Savings |
|
| 197 |
+
|-------------|-----------|------------|-------------|------------|
|
| 198 |
+
| d=1024 (Prelude) | 33M | 98M | 131M | **65M fewer** |
|
| 199 |
+
| d=2048 (1B) | 66M | 197M | 263M | **131M fewer** |
|
| 200 |
+
| d=4096 (7B) | 131M | 393M | 525M | **262M fewer** |
|
| 201 |
+
|
| 202 |
+
Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.
|
| 203 |
+
|
| 204 |
+
## References
|
| 205 |
+
|
| 206 |
+
- Liu et al. (2025) "[SuperBPE: Space Travel for Language Models](https://arxiv.org/abs/2503.13423)" — COLM 2025
|
| 207 |
+
- Tao et al. (2024) "[Scaling Laws with Vocabulary](https://proceedings.neurips.cc/paper_files/paper/2024/hash/cf5a019ae9c11b4be88213ce3f85d85c-Abstract-Conference.html)" — NeurIPS 2024
|
| 208 |
+
- "The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
|
| 209 |
+
- "IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
|
| 210 |
+
- "The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
|
| 211 |
+
- Singh & Strouse (2025) "[Tokenization Counts](https://arxiv.org/abs/2305.14201)" — ICLR 2025 (single-digit splitting)
|
| 212 |
+
|
| 213 |
+
## Part of the Quartz Tokenizer Family
|
| 214 |
+
|
| 215 |
+
| Tokenizer | Vocab | Target | Status |
|
| 216 |
+
|-----------|-------|--------|--------|
|
| 217 |
+
| QT V.2 64K | 64,000 | General multilingual | Released |
|
| 218 |
+
| QT V.2 96K | 96,000 | Extended multilingual | Released |
|
| 219 |
+
| QT V.2 Code 114K | 114,000 | Code + multilingual | Released |
|
| 220 |
+
| **QT V.3 32K UltraLingo** | **32,000** | **Parameter-efficient SuperBPE** | **New** |
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
*Built by [Quartz Data Infrastructure](https://quartz.host) for the [AENEA](https://aenea.app) model family.*
|
| 225 |
+
|
| 226 |
+
*QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.*
|