QuartzTokenizer V.4.6 32K Prelude

A 32,000-token multilingual SuperBPE tokenizer. The training corpus targets 72 languages across 27 scripts. V.4.6 resolves the Lao byte-fallback gap present in earlier releases and ships with a cleaner training corpus.

The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family. QuartzTokenizer is part of the Quartz open-source division of AENEA Global Ltd - the division that publishes AENEA's open tokenizers, data pipelines, and open models.

Quartz (open-source division): https://quartz.host/
AENEA Global Ltd: https://aeneaglobal.com/


Vocabulary size	32,000
Algorithm	SuperBPE (two-stage), byte-level BPE base
Languages targeted	72
Scripts targeted	27 (all present; 26 with strong coverage - see Known limitations)
Special tokens	86 (chat, tool-use, thinking, code, language tags)
Used by	Prelude model family

What's new in V.4.6

Lao is fixed. Earlier versions had zero Lao merges - every Lao character fell back to its raw UTF-8 bytes (compression of 1.0 bytes/token, a complete failure). V.4.6 adds 239 Lao tokens; Lao now compresses at 4.30 bytes/token on FLORES-200 devtest.
Cleaner corpus. Wikipedia template/category residue (native-language category prefixes, nested-bullet list markup, infobox formulas) was removed from the training corpus, slightly improving compression equity across underserved scripts.
Benchmarked with a health gate. Every release is now validated against FLORES-200 with explicit checks for byte-fallback collapse, zero script coverage, lossless roundtrip, and UNK emission.

Intended use

Designed as the tokenizer for the Prelude model family - small-to-mid multilingual language models. Suitable for any byte-level BPE workflow that needs broad, equitable script coverage at a 32K budget. Lossless on all tested inputs thanks to the byte-level fallback.

How to use

Hugging Face repo: JamesQuartz/qt-V.4.6-32k

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

enc = tok.encode("Everyone has the right to liberty and security of person.")
print(enc.ids)
print(tok.decode(enc.ids))

With the Transformers wrapper:

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Architecture

V.4.6 uses a two-stage SuperBPE process on a byte-level BPE base:

Stage 1 trains with a standard word-boundary pre-tokenizer (28,800 merges).
Stage 2 drops the word-boundary split (newline-isolated only), letting a portion of the remaining budget form cross-word "superword" tokens for common multi-word units. Final vocabulary: 32,000.

The byte-level base guarantees lossless encoding of any input: every byte is representable, so there is no true out-of-vocabulary case. Scripts the tokenizer has not learned merges for still encode correctly - just less efficiently.

Benchmark results (FLORES-200 devtest)

Measured on the FLORES-200 devtest split: 204 languages, 1,012 parallel sentences each. The primary metric is bytes per token (BPT) - higher means better compression. A BPT at or near ~1.0 indicates raw byte fallback (the script learned no merges). BPT is reported instead of tokens-per-word because it is well-defined for whitespace-less scripts (Lao, Thai, CJK, Tibetan), where tokens-per-word is misleading.

Summary (201 FLORES languages with usable coverage):

Metric	Value
Mean compression	3.09 bytes/token
Median compression	2.63 bytes/token
Best-compressed	Tibetan, 7.93 bytes/token
Languages passing health gate	201 / 201 in-target (3 out-of-scope FLORES extras excluded)

Selected languages:

Language	Script	Bytes/token	Vocab coverage
English	Latin	3.11	9,011
Chinese (Simp.)	Han	2.96	6,606
Japanese	Jpan	4.24	6,606
Korean	Hangul	3.68	6,606
Russian	Cyrillic	4.36	1,383
Arabic	Arabic	3.49	1,045
Hebrew	Hebrew	3.73	727
Hindi	Devanagari	6.01	1,246
Tamil	Tamil	6.63	501
Thai	Thai	6.76	953
Lao	Lao	4.30	239
Tibetan	Tibetan	7.93	1,849
Burmese	Myanmar	7.17	748
Khmer	Khmer	5.52	567

Lao fix, before/after:

Version	Lao bytes/token	Lao vocab coverage	Status
V.4.4 32K	1.02	0	byte-fallback collapse
V.4.6 32K	4.30	239	fixed

Known limitations

All 27 targeted scripts are present in the vocabulary, and 26 have strong coverage (hundreds to thousands of tokens each). One targeted script and two out-of-scope scripts are under-resourced; all encode losslessly but inefficiently.

In-scope (targeted script, under-resourced) - priority fix for V.4.7:

Language	Script	Bytes/token	Vocab tokens	Notes
Dhivehi (`dv`)	Thaana	~1.49	6	only diacritics, no base letters

Thaana is one of the 27 targeted scripts but received almost no merges (6 diacritic tokens). It is not in FLORES-200, so it was found by auditing the shipped vocabulary against the corpus language list rather than via the benchmark. Bringing Dhivehi up to usable coverage is the primary goal of V.4.7, and is the same class of fix as the Lao repair in this release.

Out-of-scope (not in the 72-language target) - candidates for future coverage:

Language	Script	Bytes/token
Tamasheq (`taq_Tfng`)	Tifinagh	~1.04
Central Atlas Tamazight (`tzm_Tfng`)	Tifinagh	~1.04
Santali (`sat_Olck`)	Ol Chiki	~1.07

These three were never part of the V.4.6 language target; they surfaced because the FLORES-200 benchmark covers more languages than the tokenizer was trained for. They are reported here for transparency and are candidates for a later expansion, but were not a V.4.6 goal.

Special tokens

Reserved structural tokens include chat roles (<|system|>, <|user|>, <|assistant|>), tool use (<|tool_call|>, <|tool_result|>), reasoning (<|thinking|>, <|/thinking|>), code (<|code|>, <|/code|>), and a set of language tags (<|lang:en|>, etc.), alongside the usual <|bos|>, <|endoftext|>, <|padding|>, <|unk|>, <|sep|>.

Version history

V.4.6 - Lao fixed (0 → 239 tokens); corpus re-cleaned; health-gated.
V.4.5 - First Lao coverage (201 tokens); Lao byte-fallback resolved.
V.4.4 - 32K; Lao absent (byte-fallback).
V.4.1 - 32K and 64K variants.

About

QuartzTokenizer is developed by the Quartz open-source division of AENEA Global Ltd. Within AENEA Global Ltd, Quartz is the open-source arm - it publishes the tokenizers, data pipelines, and open models. The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family.

License

Apache 2.0. The FLORES-200 benchmark data used for evaluation is licensed CC BY-SA 4.0 by its authors and is not redistributed here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support