QT V.4.1 64K UltraLingo — SuperBPE Tokenizer

Quartz Data Infrastructure — quartz.host | AENEA Global — aeneaglobal.com

A 64,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Overture model series (500M–2B parameters). Part of the QuartzTokenizer (QT) family.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric	QT V.4.1 64K	Llama 3 (128K)
Vocabulary size	64,000	128,256
Mean fertility (tokens/word)	3.917	5.716
Median fertility	2.593	2.700
Equity ratio (max/min fertility)	32.3x	118.6x
Total tokens (204 langs)	12,979,330	16,764,198
Languages won (head-to-head)	126/204	78/204
Token savings	−22.6%	baseline

QT V.4.1 64K achieves lower mean fertility with half the vocabulary, 3.7x better cross-lingual equity, and 22.6% fewer total tokens than Llama 3.

Architecture

QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:

1. Two-Stage SuperBPE Training

Stage 1 (57,600 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units — roots, affixes, morphemes, character sequences.
Stage 2 (6,400 tokens, 10%): SuperBPE — lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens (e.g., of the, in order to). Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, −27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
Preserves conjunct consonants by not breaking across virama (halant) marks
CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

Corpus sharded to disk for RAM-bounded training
Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
Enables SuperBPE training on consumer hardware (16 GB RAM)

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category	Share	Description
Wikipedia	70.7%	72 languages, 27 scripts — sqrt-proportional sampling with 0.3% floor per language
Stack Exchange	21.7%	English reasoning, STEM, humanities, multilingual Q&A
Code	8.0%	Python, JavaScript, Java, C/C++, Go/Rust, Shell

Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.

Per-Script Performance

FLORES-200 benchmark — mean tokens per word (lower is better):

Script	QT V.4.1 64K	Llama 3 (128K)	Languages
Latin	2.29	2.39	37
Arabic	2.10	2.70	2
Cyrillic	2.47	2.59	5
Devanagari	2.58	3.52	3
Hebrew	2.45	5.76	1
Gurmukhi	2.35	8.23	1
Armenian	2.86	12.23	1
Bengali	2.95	8.07	1
Sinhala	3.00	11.37	1
Tamil	3.16	12.45	1
Odia	3.25	16.90	1
Gujarati	3.26	10.02	1
Georgian	3.65	15.47	1
Telugu	3.71	13.36	1
Kannada	3.76	15.01	1
Ethiopic	3.77	11.95	1
Malayalam	4.00	16.33	1
Myanmar	6.05	29.77	1
Greek	2.90	2.58	1
Thai	11.74	14.03	1
Khmer	13.29	40.91	1
CJK	18.80	19.75	4
Tibetan	33.89	149.79	1
Lao	42.90	39.60	1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID	Token	Purpose
0	`<\|padding\|>`	Padding
1	`<\|bos\|>`	Beginning of sequence
2	`<\|endoftext\|>`	End of text
3	`<\|unk\|>`	Unknown
4	`<\|sep\|>`	Separator
5	`<\|system\|>`	System prompt
6	`<\|user\|>`	User turn
7	`<\|assistant\|>`	Assistant turn
8	`<\|tool_call\|>`	Tool invocation
9	`<\|tool_result\|>`	Tool response
10	`<\|thinking\|>`	Thinking open
11	`<\|/thinking\|>`	Thinking close
12	`<\|code\|>`	Code open
13	`<\|/code\|>`	Code close
14–85	`<\|lang:xx\|>`	Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.1 64K is designed as the tokenizer for the AENEA Overture model series (500M–2B parameters). It is optimised for:

Multilingual language modelling across 72 languages
Cross-lingual transfer with equitable compression across scripts
Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
Mathematical and scientific text
Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size	Tokenizer	Vocab
Sub-500M (Prelude series)	QT V.4.1 32K	32,000
500M–2B (Overture series)	QT V.4.1 64K (this model)	64,000

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        57,600 (90% — subword with whitespace boundaries)
Stage 2 vocab:        6,400 (10% — superword, no whitespace constraint)
Min frequency:        Stage 1: 2, Stage 2: 50
Sample ratio:         Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~111 minutes (RTX 4060, 16 GB RAM)

Limitations

Lao remains the weakest script (42.9 TPW) due to limited training data and absence of whitespace word boundaries. Lao Wikipedia is extremely small.
Tibetan (33.9 TPW) has improved significantly over previous versions but is still high due to the lack of whitespace delimiters. Future versions will increase the Tibetan corpus weight.
Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

Liu et al., COLM 2025 — "SuperBPE: Space Travel for Language Models"
arXiv 2511.03237 — IndicSuperTokenizer: SOTA fertility on 22 Indic languages
arXiv 2508.06533 — "The Art of Breaking Words": iterative fertility-driven reweighting
Tao et al., NeurIPS 2024 — Scaling Laws with Vocabulary
arXiv 2601.20994 — "The Depth Delusion": width > depth, 32K optimal for sub-500M
NeurIPS 2025 Workshop — "From Bias to Balance": balanced tokenizer datasets
Arnett et al. 2025 — Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.1-64k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

Quartz — Open-source data pipelines and tokenizers (quartz.host)
AENEA — Language model laboratory (aenea.app)
Crassus — Institutional credit intelligence (crassus.info)

Downloads last month: -; Downloads are not tracked for this model. How to track