QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer

The most equitable small-vocabulary multilingual tokenizer available.

A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering 71 languages across 26 writing systems. Designed for parameter-efficient small language models (sub-500M parameters) in the AENEA model family.

Key Results (FLORES-200 Benchmark, 204 languages)

Metric	QT V.3 32K	QT V.2 96K	Llama 3 128K
Vocab size	32,000	96,000	128,256
Mean fertility	4.354	3.942	5.716
Median fertility	2.792	2.574	2.700
Equity ratio	38.7×	31.6×	118.6×
Embedding params (d=1024)	33M	98M	131M

Beats Llama 3 (128K vocab) on 48/204 languages with ¼ of the vocabulary
Beats QT V.2 96K on 24/204 languages — particularly Indic and SE Asian scripts
Within 15% of QT V.2 96K on 145/204 languages despite ⅓ of the vocabulary
3× better equity than Llama 3 (38.7× vs 118.6×)

Script Family Performance (tokens/word, lower is better)

Script	QT V.3 32K	QT V.2 96K	Llama 3 128K
Latin	1.92	1.63	1.72
Cyrillic	2.83	2.24	2.43
CJK	21.54	17.25	19.64
Arabic	2.63	2.15	2.34
Indic	3.41	3.94	9.15
SE Asian	12.91	13.29	28.24

QT V.3 32K outperforms tokenizers 3-4× its size on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.

What is SuperBPE?

SuperBPE (Liu et al., COLM 2025) is a two-stage extension of BPE that allows tokens to span across word boundaries:

Stage 1 (Subword): Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
Stage 2 (Superword): Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)

The ~3,200 superword tokens improved fertility by 25% on Tamil, 19% on Malayalam, 18% on Myanmar, and 17% on Hindi and Thai compared to Stage 1 alone.

Design Innovations

SuperBPE two-stage training — first open multilingual SuperBPE tokenizer
√-proportional language weighting with 0.3% floor per language — ensures every script family gets minimum representation
71 languages, 26 scripts in a 32K vocabulary — parameter-efficient for small models
Single-digit splitting — each digit tokenized individually for arithmetic reasoning (Singh & Strouse, ICLR 2025)
85 special tokens including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning

Special Tokens

Token	ID	Purpose
`<\|padding\|>`	0	Padding
`<\|bos\|>`	1	Beginning of sequence
`<\|endoftext\|>`	2	End of text / EOS
`<\|system\|>`	5	System prompt
`<\|user\|>`	6	User turn
`<\|assistant\|>`	7	Assistant turn
`<\|thinking\|>`	10	Reasoning start
`<\|lang:XX\|>`	14-84	Language tags (71 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
# or
tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")

encoded = tok.encode("The history of mathematics began in ancient civilizations.")
print(encoded.tokens)
print(encoded.ids)

# Multilingual
encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")

Languages (71)

Tier 1 — Primary: English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean

Tier 2 — Important: Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese

Tier 3 — Coverage: Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish

Tier 4 — Minimal: Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog

Scripts (26)

Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek

Training Details

Algorithm: SuperBPE (two-stage byte-level BPE)
Pre-tokenization: LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
SuperBPE transition: 90% subword → 10% superword
Training data: Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by wiki_ultra_clean v7.2
Language weighting: √-proportional with 0.3% minimum floor per language
Normalization: None (lossless round-trip encoding)
Byte fallback: Full 256-byte coverage via ByteLevel encoding

Embedding Parameter Savings

Model Scale	QT V.3 32K	QT V.2 96K	Llama 3 128K	V.3 Savings
d=1024 (Prelude)	33M	98M	131M	65M fewer
d=2048 (1B)	66M	197M	263M	131M fewer
d=4096 (7B)	131M	393M	525M	262M fewer

Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.

References

Liu et al. (2025) "SuperBPE: Space Travel for Language Models" — COLM 2025
Tao et al. (2024) "Scaling Laws with Vocabulary" — NeurIPS 2024
"The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
"IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
"The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
Singh & Strouse (2025) "Tokenization Counts" — ICLR 2025 (single-digit splitting)

Part of the Quartz Tokenizer Family

Tokenizer	Vocab	Target	Status
QT V.2 64K	64,000	General multilingual	Released
QT V.2 96K	96,000	Extended multilingual	Released
QT V.2 Code 114K	114,000	Code + multilingual	Released
QT V.3 32K UltraLingo	32,000	Parameter-efficient SuperBPE	New

Built by Quartz Data Infrastructure for the AENEA model family.

QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for JamesQuartz/QT_V.3_32K_UltraLingo

SuperBPE: Space Travel for Language Models

Paper • 2503.13423 • Published Mar 17, 2025 • 13

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

Paper • 2305.14201 • Published May 23, 2023 • 6