JamesQuartz
/

QT_V.3_32K_UltraLingo

+---
+language:
+  - en
+  - de
+  - ja
+  - fr
+  - es
+  - ru
+  - it
+  - zh
+  - he
+  - pt
+  - ko
+  - ar
+  - nl
+  - pl
+  - uk
+  - ta
+  - cs
+  - te
+  - th
+  - fa
+  - bn
+  - hu
+  - hi
+  - sv
+  - el
+  - fi
+  - id
+  - vi
+  - hy
+  - ro
+  - 'no'
+  - sr
+  - tr
+  - bg
+  - da
+  - gl
+  - ka
+  - mr
+  - pa
+  - sl
+  - et
+  - hr
+  - kn
+  - my
+  - sk
+  - ur
+  - af
+  - lt
+  - lv
+  - ne
+  - or
+  - si
+  - sq
+  - yi
+  - am
+  - bo
+  - br
+  - ca
+  - cy
+  - dv
+  - eu
+  - ga
+  - gd
+  - gu
+  - is
+  - km
+  - la
+  - mk
+  - ml
+  - sw
+  - tl
+license: apache-2.0
+library_name: tokenizers
+tags:
+  - tokenizer
+  - multilingual
+  - superbpe
+  - bpe
+  - byte-level
+  - quartz
+  - aenea
+  - ultralingo
+---
+# QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer
+**The most equitable small-vocabulary multilingual tokenizer available.**
+A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering **71 languages across 26 writing systems**. Designed for parameter-efficient small language models (sub-500M parameters) in the [AENEA](https://aenea.app) model family.
+## Key Results (FLORES-200 Benchmark, 204 languages)
+| Metric | QT V.3 32K | QT V.2 96K | Llama 3 128K |
+|--------|-----------|------------|-------------|
+| **Vocab size** | 32,000 | 96,000 | 128,256 |
+| **Mean fertility** | 4.354 | 3.942 | 5.716 |
+| **Median fertility** | 2.792 | 2.574 | 2.700 |
+| **Equity ratio** | 38.7× | 31.6× | 118.6× |
+| **Embedding params (d=1024)** | 33M | 98M | 131M |
+- **Beats Llama 3 (128K vocab) on 48/204 languages** with ¼ of the vocabulary
+- **Beats QT V.2 96K on 24/204 languages** — particularly Indic and SE Asian scripts
+- **Within 15% of QT V.2 96K on 145/204 languages** despite ⅓ of the vocabulary
+- **3× better equity** than Llama 3 (38.7× vs 118.6×)
+## Script Family Performance (tokens/word, lower is better)
+| Script | QT V.3 32K | QT V.2 96K | Llama 3 128K |
+|--------|-----------|------------|-------------|
+| Latin | 1.92 | 1.63 | 1.72 |
+| Cyrillic | 2.83 | 2.24 | 2.43 |
+| CJK | 21.54 | 17.25 | 19.64 |
+| Arabic | 2.63 | 2.15 | 2.34 |
+| **Indic** | **3.41** | 3.94 | 9.15 |
+| **SE Asian** | **12.91** | 13.29 | 28.24 |
+QT V.3 32K **outperforms tokenizers 3-4× its size** on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.
+## What is SuperBPE?
+SuperBPE ([Liu et al., COLM 2025](https://arxiv.org/abs/2503.13423)) is a two-stage extension of BPE that allows tokens to span across word boundaries:
+- **Stage 1 (Subword):** Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
+- **Stage 2 (Superword):** Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)
+The ~3,200 superword tokens improved fertility by **25% on Tamil**, **19% on Malayalam**, **18% on Myanmar**, and **17% on Hindi and Thai** compared to Stage 1 alone.
+## Design Innovations
+1. **SuperBPE two-stage training** — first open multilingual SuperBPE tokenizer
+2. **√-proportional language weighting** with 0.3% floor per language — ensures every script family gets minimum representation
+3. **71 languages, 26 scripts** in a 32K vocabulary — parameter-efficient for small models
+4. **Single-digit splitting** — each digit tokenized individually for arithmetic reasoning ([Singh & Strouse, ICLR 2025](https://arxiv.org/abs/2305.14201))
+5. **85 special tokens** including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning
+## Special Tokens
+| Token | ID | Purpose |
+|-------|-----|---------|
+| `<\|padding\|>` | 0 | Padding |
+| `<\|bos\|>` | 1 | Beginning of sequence |
+| `<\|endoftext\|>` | 2 | End of text / EOS |
+| `<\|system\|>` | 5 | System prompt |
+| `<\|user\|>` | 6 | User turn |
+| `<\|assistant\|>` | 7 | Assistant turn |
+| `<\|thinking\|>` | 10 | Reasoning start |
+| `<\|lang:XX\|>` | 14-84 | Language tags (71 languages) |
+## Usage
+```python
+from tokenizers import Tokenizer
+tok = Tokenizer.from_file("tokenizer.json")
+# or
+tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")
+encoded = tok.encode("The history of mathematics began in ancient civilizations.")
+print(encoded.tokens)
+print(encoded.ids)
+# Multilingual
+encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
+encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
+encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")
+```
+## Languages (71)
+**Tier 1 — Primary:** English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean
+**Tier 2 — Important:** Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese
+**Tier 3 — Coverage:** Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish
+**Tier 4 — Minimal:** Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog
+## Scripts (26)
+Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek
+## Training Details
+- **Algorithm:** SuperBPE (two-stage byte-level BPE)
+- **Pre-tokenization:** LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
+- **SuperBPE transition:** 90% subword → 10% superword
+- **Training data:** Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by [wiki_ultra_clean v7.2](https://github.com/QuartzOpen/quartz-clean)
+- **Language weighting:** √-proportional with 0.3% minimum floor per language
+- **Normalization:** None (lossless round-trip encoding)
+- **Byte fallback:** Full 256-byte coverage via ByteLevel encoding
+## Embedding Parameter Savings
+| Model Scale | QT V.3 32K | QT V.2 96K | Llama 3 128K | V.3 Savings |
+|-------------|-----------|------------|-------------|------------|
+| d=1024 (Prelude) | 33M | 98M | 131M | **65M fewer** |
+| d=2048 (1B) | 66M | 197M | 263M | **131M fewer** |
+| d=4096 (7B) | 131M | 393M | 525M | **262M fewer** |
+Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.
+## References
+- Liu et al. (2025) "[SuperBPE: Space Travel for Language Models](https://arxiv.org/abs/2503.13423)" — COLM 2025
+- Tao et al. (2024) "[Scaling Laws with Vocabulary](https://proceedings.neurips.cc/paper_files/paper/2024/hash/cf5a019ae9c11b4be88213ce3f85d85c-Abstract-Conference.html)" — NeurIPS 2024
+- "The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
+- "IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
+- "The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
+- Singh & Strouse (2025) "[Tokenization Counts](https://arxiv.org/abs/2305.14201)" — ICLR 2025 (single-digit splitting)
+## Part of the Quartz Tokenizer Family
+| Tokenizer | Vocab | Target | Status |
+|-----------|-------|--------|--------|
+| QT V.2 64K | 64,000 | General multilingual | Released |
+| QT V.2 96K | 96,000 | Extended multilingual | Released |
+| QT V.2 Code 114K | 114,000 | Code + multilingual | Released |
+| **QT V.3 32K UltraLingo** | **32,000** | **Parameter-efficient SuperBPE** | **New** |
+---
+*Built by [Quartz Data Infrastructure](https://quartz.host) for the [AENEA](https://aenea.app) model family.*
+*QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.*