QT V.4.1 64K UltraLingo β€” SuperBPE Tokenizer

Quartz Data Infrastructure β€” quartz.host | AENEA Global β€” aeneaglobal.com

A 64,000-vocabulary multilingual BPE tokenizer covering 72 languages across 27 scripts, designed for the AENEA Overture model series (500M–2B parameters). Part of the QuartzTokenizer (QT) family.

Key Results

Benchmarked on FLORES-200 (204 languages, 1,012 parallel sentences each):

Metric QT V.4.1 64K Llama 3 (128K)
Vocabulary size 64,000 128,256
Mean fertility (tokens/word) 3.917 5.716
Median fertility 2.593 2.700
Equity ratio (max/min fertility) 32.3x 118.6x
Total tokens (204 langs) 12,979,330 16,764,198
Languages won (head-to-head) 126/204 78/204
Token savings βˆ’22.6% baseline

QT V.4.1 64K achieves lower mean fertility with half the vocabulary, 3.7x better cross-lingual equity, and 22.6% fewer total tokens than Llama 3.

Architecture

QT V.4.1 is a two-stage SuperBPE tokenizer with three innovations over standard BPE:

1. Two-Stage SuperBPE Training

  • Stage 1 (57,600 tokens, 90%): Standard BPE with Llama 3 / GPT-4 style whitespace pre-tokenization. Learns subword units β€” roots, affixes, morphemes, character sequences.
  • Stage 2 (6,400 tokens, 10%): SuperBPE β€” lifts the whitespace boundary constraint, allowing merges to span across word boundaries. Learns high-frequency multi-word superword tokens (e.g., of the, in order to). Sentence boundary protection prevents cross-sentence tokens.

Based on Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models" (+4.0% downstream, +8.2% MMLU, βˆ’27% inference compute).

2. Script-Aware Pre-Tokenization (Indic Only)

  • Virama-aware character segmentation for Indic scripts (Devanagari, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala)
  • Preserves conjunct consonants by not breaking across virama (halant) marks
  • CJK, Thai, Lao, Khmer, Myanmar, and Tibetan are left as raw text to enable proper multi-character merge learning

3. Streaming Sharded Training

  • Corpus sharded to disk for RAM-bounded training
  • Separate sample ratios and minimum frequencies for Stage 1 and Stage 2
  • Enables SuperBPE training on consumer hardware (16 GB RAM)

Training Data

Trained on a balanced multilingual corpus (~5 GB target, 0.35 effective sample ratio):

Category Share Description
Wikipedia 70.7% 72 languages, 27 scripts β€” sqrt-proportional sampling with 0.3% floor per language
Stack Exchange 21.7% English reasoning, STEM, humanities, multilingual Q&A
Code 8.0% Python, JavaScript, Java, C/C++, Go/Rust, Shell

Corpus design follows "The Art of Breaking Words" (arXiv 2508.06533) iterative fertility balancing and "One Tokenizer to Rule Them All" script/family bucket approach.

Per-Script Performance

FLORES-200 benchmark β€” mean tokens per word (lower is better):

Script QT V.4.1 64K Llama 3 (128K) Languages
Latin 2.29 2.39 37
Arabic 2.10 2.70 2
Cyrillic 2.47 2.59 5
Devanagari 2.58 3.52 3
Hebrew 2.45 5.76 1
Gurmukhi 2.35 8.23 1
Armenian 2.86 12.23 1
Bengali 2.95 8.07 1
Sinhala 3.00 11.37 1
Tamil 3.16 12.45 1
Odia 3.25 16.90 1
Gujarati 3.26 10.02 1
Georgian 3.65 15.47 1
Telugu 3.71 13.36 1
Kannada 3.76 15.01 1
Ethiopic 3.77 11.95 1
Malayalam 4.00 16.33 1
Myanmar 6.05 29.77 1
Greek 2.90 2.58 1
Thai 11.74 14.03 1
Khmer 13.29 40.91 1
CJK 18.80 19.75 4
Tibetan 33.89 149.79 1
Lao 42.90 39.60 1

Special Tokens

14 structural tokens + 72 language tags = 86 special tokens total.

ID Token Purpose
0 <|padding|> Padding
1 <|bos|> Beginning of sequence
2 <|endoftext|> End of text
3 <|unk|> Unknown
4 <|sep|> Separator
5 <|system|> System prompt
6 <|user|> User turn
7 <|assistant|> Assistant turn
8 <|tool_call|> Tool invocation
9 <|tool_result|> Tool response
10 <|thinking|> Thinking open
11 <|/thinking|> Thinking close
12 <|code|> Code open
13 <|/code|> Code close
14–85 <|lang:xx|> Language tags (72 languages)

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

# Encode
encoded = tok.encode("The history of the Roman Empire spans centuries.")
print(encoded.ids)      # Token IDs
print(encoded.tokens)   # Token strings

# Decode
text = tok.decode(encoded.ids)
print(text)

Intended Use

QT V.4.1 64K is designed as the tokenizer for the AENEA Overture model series (500M–2B parameters). It is optimised for:

  • Multilingual language modelling across 72 languages
  • Cross-lingual transfer with equitable compression across scripts
  • Code generation (Python, JavaScript, Java, C/C++, Go, Rust)
  • Mathematical and scientific text
  • Instruction-following with dedicated chat tokens

Recommended Pairing

Model Size Tokenizer Vocab
Sub-500M (Prelude series) QT V.4.1 32K 32,000
500M–2B (Overture series) QT V.4.1 64K (this model) 64,000

Training Configuration

Algorithm:            SuperBPE (two-stage)
Stage 1 vocab:        57,600 (90% β€” subword with whitespace boundaries)
Stage 2 vocab:        6,400 (10% β€” superword, no whitespace constraint)
Min frequency:        Stage 1: 2, Stage 2: 50
Sample ratio:         Stage 1: 0.35, Stage 2: 0.08
Pre-tokenization:     Script-aware (Indic virama segmentation)
Training mode:        Streaming sharded (500 MB shards)
Seed:                 42
Training time:        ~111 minutes (RTX 4060, 16 GB RAM)

Limitations

  • Lao remains the weakest script (42.9 TPW) due to limited training data and absence of whitespace word boundaries. Lao Wikipedia is extremely small.
  • Tibetan (33.9 TPW) has improved significantly over previous versions but is still high due to the lack of whitespace delimiters. Future versions will increase the Tibetan corpus weight.
  • Scripts without whitespace (Thai, Khmer, Lao, Tibetan, CJK) inherently require more tokens per word under BPE with whitespace pre-tokenization.
  • The tokenizer is trained for tokenization quality, not for any specific downstream task. Model performance depends on the language model trained on top.

References

  • Liu et al., COLM 2025 β€” "SuperBPE: Space Travel for Language Models"
  • arXiv 2511.03237 β€” IndicSuperTokenizer: SOTA fertility on 22 Indic languages
  • arXiv 2508.06533 β€” "The Art of Breaking Words": iterative fertility-driven reweighting
  • Tao et al., NeurIPS 2024 β€” Scaling Laws with Vocabulary
  • arXiv 2601.20994 β€” "The Depth Delusion": width > depth, 32K optimal for sub-500M
  • NeurIPS 2025 Workshop β€” "From Bias to Balance": balanced tokenizer datasets
  • Arnett et al. 2025 β€” Crosslingual Tokenizer Inequities

Citation

@misc{downey2026qt,
  title={QT V.4.1 UltraLingo: A Streaming Script-Aware SuperBPE Tokenizer for Equitable Multilingual Language Modelling},
  author={Downey, James},
  year={2026},
  publisher={AENEA Global Ltd},
  url={https://huggingface.co/JamesQuartz/qt-v4.1-64k-ultralingo}
}

About

Built by James Downey at AENEA Global Ltd (Company No. 16743851, Manchester).

  • Quartz β€” Open-source data pipelines and tokenizers (quartz.host)
  • AENEA β€” Language model laboratory (aenea.app)
  • Crassus β€” Institutional credit intelligence (crassus.info)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support