QT V.4.1 64K UltraLingo β SuperBPE Tokenizer
Quartz Data Infrastructure β quartz.host | AENEA Global β aeneaglobal.com
A 64,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Overture model series (500Mβ2B parameters). Part of the QuartzTokenizer (QT) family.
Key Results
Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):
| Metric | QT V.4.1 64K | Llama 3 (128K) |
|---|---|---|
| Vocabulary size | 64,000 | 128,256 |
| Mean fertility (tokens/word) | 3.917 | 5.716 |
| Median fertility | 2.593 | 2.700 |
| Equity ratio (max/min fertility) | 32.3x | 118.6x |
| Total tokens (204 langs) | 12,979,330 | 16,764,198 |
| Languages won (head-to-head) | 126/204 | 78/204 |
| Token savings | β22.6% | baseline |
QT V.4.1 64K achieves lower mean fertility with half the vocabulary, 3.7x better cross-lingual equity, and 22.6% fewer total tokens than Llama 3.
Architecture
QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:
1. Two-Stage SuperBPE Training
- Stage 1 (57,600 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β roots, affixes, morphemes, character sequences.
- Stage 2 (6,400 tokens, 10%): SuperBPE β lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens (e.g.,
of the,in order to). Sentence boundary protection prevents cross-sentence tokens.
Based on Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, β27% inference compute).
2. Script-Aware Pre-Tokenization (Indic Only)
- Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
- Preserves conjunct consonants by not breaking across virama (halant) marks
- CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning
3. Streaming Sharded Training
- Corpus sharded to disk for RAM-bounded training
- Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
- Enables SuperBPE training on consumer hardware (16 GB RAM)
Training Data
Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):
| Category | Share | Description |
|---|---|---|
| Wikipedia | 70.7% | 72 languages, 27 scripts β sqrt-proportional sampling with 0.3% floor per language |
| Stack Exchange | 21.7% | English reasoning, STEM, humanities, multilingual Q&A |
| Code | 8.0% | Python, JavaScript, Java, C/C++, Go/Rust, Shell |
Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.
Per-Script Performance
FLORES-200 benchmark β mean tokens per word (lower is better):
| Script | QT V.4.1 64K | Llama 3 (128K) | Languages |
|---|---|---|---|
| Latin | 2.29 | 2.39 | 37 |
| Arabic | 2.10 | 2.70 | 2 |
| Cyrillic | 2.47 | 2.59 | 5 |
| Devanagari | 2.58 | 3.52 | 3 |
| Hebrew | 2.45 | 5.76 | 1 |
| Gurmukhi | 2.35 | 8.23 | 1 |
| Armenian | 2.86 | 12.23 | 1 |
| Bengali | 2.95 | 8.07 | 1 |
| Sinhala | 3.00 | 11.37 | 1 |
| Tamil | 3.16 | 12.45 | 1 |
| Odia | 3.25 | 16.90 | 1 |
| Gujarati | 3.26 | 10.02 | 1 |
| Georgian | 3.65 | 15.47 | 1 |
| Telugu | 3.71 | 13.36 | 1 |
| Kannada | 3.76 | 15.01 | 1 |
| Ethiopic | 3.77 | 11.95 | 1 |
| Malayalam | 4.00 | 16.33 | 1 |
| Myanmar | 6.05 | 29.77 | 1 |
| Greek | 2.90 | 2.58 | 1 |
| Thai | 11.74 | 14.03 | 1 |
| Khmer | 13.29 | 40.91 | 1 |
| CJK | 18.80 | 19.75 | 4 |
| Tibetan | 33.89 | 149.79 | 1 |
| Lao | 42.90 | 39.60 | 1 |
Special Tokens
14 structural tokens + 72 language tags = 86 special tokens total.
| ID | Token | Purpose |
|---|---|---|
| 0 | <|padding|> |
Padding |
| 1 | <|bos|> |
Beginning of sequence |
| 2 | <|endoftext|> |
End of text |
| 3 | <|unk|> |
Unknown |
| 4 | <|sep|> |
Separator |
| 5 | <|system|> |
System prompt |
| 6 | <|user|> |
User turn |
| 7 | <|assistant|> |
Assistant turn |
| 8 | <|tool_call|> |
Tool invocation |
| 9 | <|tool_result|> |
Tool response |
| 10 | <|thinking|> |
Thinking open |
| 11 | <|/thinking|> |
Thinking close |
| 12 | <|code|> |
Code open |
| 13 | <|/code|> |
Code close |
| 14β85 | <|lang:xx|> |
Language tags (72 languages) |
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids) # Token IDs
print(encoded.tokens) # Token strings
# Decode
text = tok.decode(encoded.ids)
print(text)
Intended Use
QT V.4.1 64K is designed as the tokenizer for the AENEA Overture model series (500Mβ2B parameters). It is optimised for:
- Multilingual language modelling across 72 languages
- Cross-lingual transfer with equitable compression across scripts
- Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
- Mathematical and scientific text
- Instruction-following with dedicated chat tokens
Recommended Pairing
| Model Size | Tokenizer | Vocab |
|---|---|---|
| Sub-500M (Prelude series) | QT V.4.1 32K | 32,000 |
| 500Mβ2B (Overture series) | QT V.4.1 64K (this model) | 64,000 |
Training Configuration
Algorithm: SuperBPE (two-stage)
Stage 1 vocab: 57,600 (90% β subword with whitespace boundaries)
Stage 2 vocab: 6,400 (10% β superword, no whitespace constraint)
Min frequency: Stage 1: 2, Stage 2: 50
Sample ratio: Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization: Script-aware (Indic virama segmentation)
Training mode: Streaming sharded (500 MB shards)
Seed: 42
Training time: ~111 minutes (RTX 4060, 16 GB RAM)
Limitations
- Lao remains the weakest script (42.9 TPW) due to limited training data and absence of whitespace word boundaries. Lao Wikipedia is extremely small.
- Tibetan (33.9 TPW) has improved significantly over previous versions but is still high due to the lack of whitespace delimiters. Future versions will increase the Tibetan corpus weight.
- Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
- The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.
References
- Liu et al., COLM 2025 β "SuperBPE: Space Travel for Language Models"
- arXiv 2511.03237 β IndicSuperTokenizer: SOTA fertility on 22 Indic languages
- arXiv 2508.06533 β "The Art of Breaking Words": iterative fertility-driven reweighting
- Tao et al., NeurIPS 2024 β Scaling Laws with Vocabulary
- arXiv 2601.20994 β "The Depth Delusion": width > depth, 32K optimal for sub-500M
- NeurIPS 2025 Workshop β "From Bias to Balance": balanced tokenizer datasets
- Arnett et al. 2025 β Crosslingual Tokenizer Inequities
Citation
@misc{downey2026qt,
title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
author={Downey, James},
year={2026},
publisher={AENEA Global Ltd},
url={https://huggingface.co/JamesQuartz/qt-v4.1-64k-ultralingo}
}
About
Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).
- Quartz β Open-source data pipelines and tokenizers (quartz.host)
- AENEA β Language model laboratory (aenea.app)
- Crassus β Institutional credit intelligence (crassus.info)