QT V.3 32K UltraLingo — SuperBPE Multilingual Tokenizer

The most equitable small-vocabulary multilingual tokenizer available.

A 32,000-token byte-level BPE tokenizer with SuperBPE two-stage training, covering 71 languages across 26 writing systems. Designed for parameter-efficient small language models (sub-500M parameters) in the AENEA model family.

Key Results (FLORES-200 Benchmark, 204 languages)

Metric QT V.3 32K QT V.2 96K Llama 3 128K
Vocab size 32,000 96,000 128,256
Mean fertility 4.354 3.942 5.716
Median fertility 2.792 2.574 2.700
Equity ratio 38.7× 31.6× 118.6×
Embedding params (d=1024) 33M 98M 131M
  • Beats Llama 3 (128K vocab) on 48/204 languages with ¼ of the vocabulary
  • Beats QT V.2 96K on 24/204 languages — particularly Indic and SE Asian scripts
  • Within 15% of QT V.2 96K on 145/204 languages despite ⅓ of the vocabulary
  • 3× better equity than Llama 3 (38.7× vs 118.6×)

Script Family Performance (tokens/word, lower is better)

Script QT V.3 32K QT V.2 96K Llama 3 128K
Latin 1.92 1.63 1.72
Cyrillic 2.83 2.24 2.43
CJK 21.54 17.25 19.64
Arabic 2.63 2.15 2.34
Indic 3.41 3.94 9.15
SE Asian 12.91 13.29 28.24

QT V.3 32K outperforms tokenizers 3-4× its size on Indic languages (Tamil, Telugu, Hindi, Bengali, Myanmar) and SE Asian scripts, while remaining competitive on Latin and Cyrillic.

What is SuperBPE?

SuperBPE (Liu et al., COLM 2025) is a two-stage extension of BPE that allows tokens to span across word boundaries:

  • Stage 1 (Subword): Standard BPE with whitespace boundaries — learns roots, affixes, morphemes (90% of vocabulary)
  • Stage 2 (Superword): Whitespace constraint lifted — learns multi-word expressions like "in order to", "as well as" (10% of vocabulary)

The ~3,200 superword tokens improved fertility by 25% on Tamil, 19% on Malayalam, 18% on Myanmar, and 17% on Hindi and Thai compared to Stage 1 alone.

Design Innovations

  1. SuperBPE two-stage training — first open multilingual SuperBPE tokenizer
  2. √-proportional language weighting with 0.3% floor per language — ensures every script family gets minimum representation
  3. 71 languages, 26 scripts in a 32K vocabulary — parameter-efficient for small models
  4. Single-digit splitting — each digit tokenized individually for arithmetic reasoning (Singh & Strouse, ICLR 2025)
  5. 85 special tokens including instruct markers, language tags, reasoning markers, and tool-use tokens — future-proofed for instruction tuning

Special Tokens

Token ID Purpose
<|padding|> 0 Padding
<|bos|> 1 Beginning of sequence
<|endoftext|> 2 End of text / EOS
<|system|> 5 System prompt
<|user|> 6 User turn
<|assistant|> 7 Assistant turn
<|thinking|> 10 Reasoning start
<|lang:XX|> 14-84 Language tags (71 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
# or
tok = Tokenizer.from_pretrained("JamesQuartz/QT_V.3_32K_UltraLingo")

encoded = tok.encode("The history of mathematics began in ancient civilizations.")
print(encoded.tokens)
print(encoded.ids)

# Multilingual
encoded_ja = tok.encode("日本の歴史は縄文時代から始まり")
encoded_ta = tok.encode("இந்தியா தெற்காசியாவில் அமைந்துள்ள ஒரு நாடு")
encoded_ar = tok.encode("تأسست الدولة العباسية في عام سبعمائة")

Languages (71)

Tier 1 — Primary: English, German, Japanese, French, Spanish, Russian, Italian, Chinese, Hebrew, Portuguese, Korean

Tier 2 — Important: Arabic, Dutch, Polish, Ukrainian, Tamil, Czech, Telugu, Thai, Persian, Bengali, Hungarian, Hindi, Malayalam, Swedish, Greek, Finnish, Indonesian, Vietnamese

Tier 3 — Coverage: Basque, Norwegian, Romanian, Serbian, Turkish, Bulgarian, Danish, Galician, Georgian, Marathi, Punjabi, Slovenian, Estonian, Croatian, Kannada, Myanmar, Slovak, Urdu, Afrikaans, Lithuanian, Latvian, Nepali, Odia, Sinhala, Albanian, Yiddish

Tier 4 — Minimal: Amharic, Tibetan, Breton, Catalan, Welsh, Dhivehi, Irish, Scots Gaelic, Gujarati, Icelandic, Khmer, Latin, Macedonian, Swahili, Tagalog

Scripts (26)

Latin, Cyrillic, Han (Simplified/Traditional), Hiragana/Katakana, Hangul, Arabic, Hebrew, Devanagari, Bengali, Tamil, Telugu, Thai, Malayalam, Kannada, Gujarati, Gurmukhi, Myanmar, Khmer, Tibetan, Sinhala, Odia, Georgian, Armenian, Ethiopic, Thaana, Greek

Training Details

  • Algorithm: SuperBPE (two-stage byte-level BPE)
  • Pre-tokenization: LLaMA-style regex with single-digit splitting (Stage 1), sentence-boundary-only splitting (Stage 2)
  • SuperBPE transition: 90% subword → 10% superword
  • Training data: Balanced multilingual Wikipedia (71 languages) + Stack Exchange + Code, processed by wiki_ultra_clean v7.2
  • Language weighting: √-proportional with 0.3% minimum floor per language
  • Normalization: None (lossless round-trip encoding)
  • Byte fallback: Full 256-byte coverage via ByteLevel encoding

Embedding Parameter Savings

Model Scale QT V.3 32K QT V.2 96K Llama 3 128K V.3 Savings
d=1024 (Prelude) 33M 98M 131M 65M fewer
d=2048 (1B) 66M 197M 263M 131M fewer
d=4096 (7B) 131M 393M 525M 262M fewer

Those saved parameters fund additional transformer layers where they contribute to reasoning capability rather than sitting in an underutilised embedding table.

References

  • Liu et al. (2025) "SuperBPE: Space Travel for Language Models" — COLM 2025
  • Tao et al. (2024) "Scaling Laws with Vocabulary" — NeurIPS 2024
  • "The Art of Breaking Words" (2025) — arXiv 2508.06533 (iterative fertility balancing)
  • "IndicSuperTokenizer" (2025) — arXiv 2511.03237 (two-stage subword+superword for Indic)
  • "The Depth Delusion" (2026) — arXiv 2601.20994 (width > depth, 32K optimal for small models)
  • Singh & Strouse (2025) "Tokenization Counts" — ICLR 2025 (single-digit splitting)

Part of the Quartz Tokenizer Family

Tokenizer Vocab Target Status
QT V.2 64K 64,000 General multilingual Released
QT V.2 96K 96,000 Extended multilingual Released
QT V.2 Code 114K 114,000 Code + multilingual Released
QT V.3 32K UltraLingo 32,000 Parameter-efficient SuperBPE New

Built by Quartz Data Infrastructure for the AENEA model family.

QT V.3 UltraLingo: Fewer tokens. More meaning. Every language.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for JamesQuartz/QT_V.3_32K_UltraLingo