QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer
The most equitable small-vocabulary multilingual tokenizer available.
A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering 71 languages across 26 writing systems. Designed for parameter-efficient small language models (sub-500M parameters) in the AENEA model family.
Key Results (FLORES-200 Benchmark, 204 languages)
| Metric | QT V.3 32K | QT V.2 96K | Llama 3 128K |
|---|---|---|---|
| Vocab size | 32,000 | 96,000 | 128,256 |
| Mean fertility | 4.354 | 3.942 | 5.716 |
| Median fertility | 2.792 | 2.574 | 2.700 |
| Equity ratio | 38.7× | 31.6× | 118.6× |
| Embedding params (d=1024) | 33M | 98M | 131M |
- Beats Llama 3 (128K vocab) on 48/204 languages with ¼ of the vocabulary
- Beats QT V.2 96K on 24/204 languages — particularly Indic and SE Asian scripts
- Within 15% of QT V.2 96K on 145/204 languages despite ⅓ of the vocabulary
- 3× better equity than Llama 3 (38.7× vs 118.6×)
Script Family Performance (tokens/word, lower is better)
| Script | QT V.3 32K | QT V.2 96K | Llama 3 128K |
|---|---|---|---|
| Latin | 1.92 | 1.63 | 1.72 |
| Cyrillic | 2.83 | 2.24 | 2.43 |
| CJK | 21.54 | 17.25 | 19.64 |
| Arabic | 2.63 | 2.15 | 2.34 |
| Indic | 3.41 | 3.94 | 9.15 |
| SE Asian | 12.91 | 13.29 | 28.24 |
QT V.3 32K outperforms tokenizers 3-4× its size on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.
What is SuperBPE?
SuperBPE (Liu et al., COLM 2025) is a two-stage extension of BPE that allows tokens to span across word boundaries:
- Stage 1 (Subword): Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
- Stage 2 (Superword): Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)
The ~3,200 superword tokens improved fertility by 25% on Tamil, 19% on Malayalam, 18% on Myanmar, and 17% on Hindi and Thai compared to Stage 1 alone.
Design Innovations
- SuperBPE two-stage training — first open multilingual SuperBPE tokenizer
- √-proportional language weighting with 0.3% floor per language — ensures every script family gets minimum representation
- 71 languages, 26 scripts in a 32K vocabulary — parameter-efficient for small models
- Single-digit splitting — each digit tokenized individually for arithmetic reasoning (Singh & Strouse, ICLR 2025)
- 85 special tokens including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|padding|> |
0 | Padding |
<|bos|> |
1 | Beginning of sequence |
<|endoftext|> |
2 | End of text / EOS |
<|system|> |
5 | System prompt |
<|user|> |
6 | User turn |
<|assistant|> |
7 | Assistant turn |
<|thinking|> |
10 | Reasoning start |
<|lang:XX|> |
14-84 | Language tags (71 languages) |
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
# or
tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")
encoded = tok.encode("The history of mathematics began in ancient civilizations.")
print(encoded.tokens)
print(encoded.ids)
# Multilingual
encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")
Languages (71)
Tier 1 — Primary: English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean
Tier 2 — Important: Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese
Tier 3 — Coverage: Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish
Tier 4 — Minimal: Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog
Scripts (26)
Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek
Training Details
- Algorithm: SuperBPE (two-stage byte-level BPE)
- Pre-tokenization: LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
- SuperBPE transition: 90% subword → 10% superword
- Training data: Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by wiki_ultra_clean v7.2
- Language weighting: √-proportional with 0.3% minimum floor per language
- Normalization: None (lossless round-trip encoding)
- Byte fallback: Full 256-byte coverage via ByteLevel encoding
Embedding Parameter Savings
| Model Scale | QT V.3 32K | QT V.2 96K | Llama 3 128K | V.3 Savings |
|---|---|---|---|---|
| d=1024 (Prelude) | 33M | 98M | 131M | 65M fewer |
| d=2048 (1B) | 66M | 197M | 263M | 131M fewer |
| d=4096 (7B) | 131M | 393M | 525M | 262M fewer |
Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.
References
- Liu et al. (2025) "SuperBPE: Space Travel for Language Models" — COLM 2025
- Tao et al. (2024) "Scaling Laws with Vocabulary" — NeurIPS 2024
- "The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
- "IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
- "The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
- Singh & Strouse (2025) "Tokenization Counts" — ICLR 2025 (single-digit splitting)
Part of the Quartz Tokenizer Family
| Tokenizer | Vocab | Target | Status |
|---|---|---|---|
| QT V.2 64K | 64,000 | General multilingual | Released |
| QT V.2 96K | 96,000 | Extended multilingual | Released |
| QT V.2 Code 114K | 114,000 | Code + multilingual | Released |
| QT V.3 32K UltraLingo | 32,000 | Parameter-efficient SuperBPE | New |
Built by Quartz Data Infrastructure for the AENEA model family.
QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.