QuartzTokenizer V.4.6 32K Prelude
A 32,000-token multilingual SuperBPE tokenizer. The training corpus targets 72 languages across 27 scripts. V.4.6 resolves the Lao byte-fallback gap present in earlier releases and ships with a cleaner training corpus.
The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family. QuartzTokenizer is part of the Quartz open-source division of AENEA Global Ltd - the division that publishes AENEA's open tokenizers, data pipelines, and open models.
- Quartz (open-source division): https://quartz.host/
- AENEA Global Ltd: https://aeneaglobal.com/
| Vocabulary size | 32,000 |
| Algorithm | SuperBPE (two-stage), byte-level BPE base |
| Languages targeted | 72 |
| Scripts targeted | 27 (all present; 26 with strong coverage - see Known limitations) |
| Special tokens | 86 (chat, tool-use, thinking, code, language tags) |
| Used by | Prelude model family |
What's new in V.4.6
- Lao is fixed. Earlier versions had zero Lao merges - every Lao character fell back to its raw UTF-8 bytes (compression of 1.0 bytes/token, a complete failure). V.4.6 adds 239 Lao tokens; Lao now compresses at 4.30 bytes/token on FLORES-200 devtest.
- Cleaner corpus. Wikipedia template/category residue (native-language category prefixes, nested-bullet list markup, infobox formulas) was removed from the training corpus, slightly improving compression equity across underserved scripts.
- Benchmarked with a health gate. Every release is now validated against FLORES-200 with explicit checks for byte-fallback collapse, zero script coverage, lossless roundtrip, and UNK emission.
Intended use
Designed as the tokenizer for the Prelude model family - small-to-mid multilingual language models. Suitable for any byte-level BPE workflow that needs broad, equitable script coverage at a 32K budget. Lossless on all tested inputs thanks to the byte-level fallback.
How to use
Hugging Face repo: JamesQuartz/qt-V.4.6-32k
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
enc = tok.encode("Everyone has the right to liberty and security of person.")
print(enc.ids)
print(tok.decode(enc.ids))
With the Transformers wrapper:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
Architecture
V.4.6 uses a two-stage SuperBPE process on a byte-level BPE base:
- Stage 1 trains with a standard word-boundary pre-tokenizer (28,800 merges).
- Stage 2 drops the word-boundary split (newline-isolated only), letting a portion of the remaining budget form cross-word "superword" tokens for common multi-word units. Final vocabulary: 32,000.
The byte-level base guarantees lossless encoding of any input: every byte is representable, so there is no true out-of-vocabulary case. Scripts the tokenizer has not learned merges for still encode correctly - just less efficiently.
Benchmark results (FLORES-200 devtest)
Measured on the FLORES-200 devtest split: 204 languages, 1,012 parallel sentences each. The primary metric is bytes per token (BPT) - higher means better compression. A BPT at or near ~1.0 indicates raw byte fallback (the script learned no merges). BPT is reported instead of tokens-per-word because it is well-defined for whitespace-less scripts (Lao, Thai, CJK, Tibetan), where tokens-per-word is misleading.
Summary (201 FLORES languages with usable coverage):
| Metric | Value |
|---|---|
| Mean compression | 3.09 bytes/token |
| Median compression | 2.63 bytes/token |
| Best-compressed | Tibetan, 7.93 bytes/token |
| Languages passing health gate | 201 / 201 in-target (3 out-of-scope FLORES extras excluded) |
Selected languages:
| Language | Script | Bytes/token | Vocab coverage |
|---|---|---|---|
| English | Latin | 3.11 | 9,011 |
| Chinese (Simp.) | Han | 2.96 | 6,606 |
| Japanese | Jpan | 4.24 | 6,606 |
| Korean | Hangul | 3.68 | 6,606 |
| Russian | Cyrillic | 4.36 | 1,383 |
| Arabic | Arabic | 3.49 | 1,045 |
| Hebrew | Hebrew | 3.73 | 727 |
| Hindi | Devanagari | 6.01 | 1,246 |
| Tamil | Tamil | 6.63 | 501 |
| Thai | Thai | 6.76 | 953 |
| Lao | Lao | 4.30 | 239 |
| Tibetan | Tibetan | 7.93 | 1,849 |
| Burmese | Myanmar | 7.17 | 748 |
| Khmer | Khmer | 5.52 | 567 |
Lao fix, before/after:
| Version | Lao bytes/token | Lao vocab coverage | Status |
|---|---|---|---|
| V.4.4 32K | 1.02 | 0 | byte-fallback collapse |
| V.4.6 32K | 4.30 | 239 | fixed |
Known limitations
All 27 targeted scripts are present in the vocabulary, and 26 have strong coverage (hundreds to thousands of tokens each). One targeted script and two out-of-scope scripts are under-resourced; all encode losslessly but inefficiently.
In-scope (targeted script, under-resourced) - priority fix for V.4.7:
| Language | Script | Bytes/token | Vocab tokens | Notes |
|---|---|---|---|---|
Dhivehi (dv) |
Thaana | ~1.49 | 6 | only diacritics, no base letters |
Thaana is one of the 27 targeted scripts but received almost no merges (6 diacritic tokens). It is not in FLORES-200, so it was found by auditing the shipped vocabulary against the corpus language list rather than via the benchmark. Bringing Dhivehi up to usable coverage is the primary goal of V.4.7, and is the same class of fix as the Lao repair in this release.
Out-of-scope (not in the 72-language target) - candidates for future coverage:
| Language | Script | Bytes/token |
|---|---|---|
Tamasheq (taq_Tfng) |
Tifinagh | ~1.04 |
Central Atlas Tamazight (tzm_Tfng) |
Tifinagh | ~1.04 |
Santali (sat_Olck) |
Ol Chiki | ~1.07 |
These three were never part of the V.4.6 language target; they surfaced because the FLORES-200 benchmark covers more languages than the tokenizer was trained for. They are reported here for transparency and are candidates for a later expansion, but were not a V.4.6 goal.
Special tokens
Reserved structural tokens include chat roles (<|system|>, <|user|>,
<|assistant|>), tool use (<|tool_call|>, <|tool_result|>), reasoning
(<|thinking|>, <|/thinking|>), code (<|code|>, <|/code|>), and a set of
language tags (<|lang:en|>, etc.), alongside the usual <|bos|>,
<|endoftext|>, <|padding|>, <|unk|>, <|sep|>.
Version history
- V.4.6 - Lao fixed (0 → 239 tokens); corpus re-cleaned; health-gated.
- V.4.5 - First Lao coverage (201 tokens); Lao byte-fallback resolved.
- V.4.4 - 32K; Lao absent (byte-fallback).
- V.4.1 - 32K and 64K variants.
About
QuartzTokenizer is developed by the Quartz open-source division of AENEA Global Ltd. Within AENEA Global Ltd, Quartz is the open-source arm - it publishes the tokenizers, data pipelines, and open models. The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family.
License
Apache 2.0. The FLORES-200 benchmark data used for evaluation is licensed CC BY-SA 4.0 by its authors and is not redistributed here.