QuartzTokenizer V.4.6 32K Prelude

A 32,000-token multilingual SuperBPE tokenizer. The training corpus targets 72 languages across 27 scripts. V.4.6 resolves the Lao byte-fallback gap present in earlier releases and ships with a cleaner training corpus.

The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family. QuartzTokenizer is part of the Quartz open-source division of AENEA Global Ltd - the division that publishes AENEA's open tokenizers, data pipelines, and open models.

Vocabulary size 32,000
Algorithm SuperBPE (two-stage), byte-level BPE base
Languages targeted 72
Scripts targeted 27 (all present; 26 with strong coverage - see Known limitations)
Special tokens 86 (chat, tool-use, thinking, code, language tags)
Used by Prelude model family

What's new in V.4.6

  • Lao is fixed. Earlier versions had zero Lao merges - every Lao character fell back to its raw UTF-8 bytes (compression of 1.0 bytes/token, a complete failure). V.4.6 adds 239 Lao tokens; Lao now compresses at 4.30 bytes/token on FLORES-200 devtest.
  • Cleaner corpus. Wikipedia template/category residue (native-language category prefixes, nested-bullet list markup, infobox formulas) was removed from the training corpus, slightly improving compression equity across underserved scripts.
  • Benchmarked with a health gate. Every release is now validated against FLORES-200 with explicit checks for byte-fallback collapse, zero script coverage, lossless roundtrip, and UNK emission.

Intended use

Designed as the tokenizer for the Prelude model family - small-to-mid multilingual language models. Suitable for any byte-level BPE workflow that needs broad, equitable script coverage at a 32K budget. Lossless on all tested inputs thanks to the byte-level fallback.

How to use

Hugging Face repo: JamesQuartz/qt-V.4.6-32k

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")

enc = tok.encode("Everyone has the right to liberty and security of person.")
print(enc.ids)
print(tok.decode(enc.ids))

With the Transformers wrapper:

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Architecture

V.4.6 uses a two-stage SuperBPE process on a byte-level BPE base:

  • Stage 1 trains with a standard word-boundary pre-tokenizer (28,800 merges).
  • Stage 2 drops the word-boundary split (newline-isolated only), letting a portion of the remaining budget form cross-word "superword" tokens for common multi-word units. Final vocabulary: 32,000.

The byte-level base guarantees lossless encoding of any input: every byte is representable, so there is no true out-of-vocabulary case. Scripts the tokenizer has not learned merges for still encode correctly - just less efficiently.

Benchmark results (FLORES-200 devtest)

Measured on the FLORES-200 devtest split: 204 languages, 1,012 parallel sentences each. The primary metric is bytes per token (BPT) - higher means better compression. A BPT at or near ~1.0 indicates raw byte fallback (the script learned no merges). BPT is reported instead of tokens-per-word because it is well-defined for whitespace-less scripts (Lao, Thai, CJK, Tibetan), where tokens-per-word is misleading.

Summary (201 FLORES languages with usable coverage):

Metric Value
Mean compression 3.09 bytes/token
Median compression 2.63 bytes/token
Best-compressed Tibetan, 7.93 bytes/token
Languages passing health gate 201 / 201 in-target (3 out-of-scope FLORES extras excluded)

Selected languages:

Language Script Bytes/token Vocab coverage
English Latin 3.11 9,011
Chinese (Simp.) Han 2.96 6,606
Japanese Jpan 4.24 6,606
Korean Hangul 3.68 6,606
Russian Cyrillic 4.36 1,383
Arabic Arabic 3.49 1,045
Hebrew Hebrew 3.73 727
Hindi Devanagari 6.01 1,246
Tamil Tamil 6.63 501
Thai Thai 6.76 953
Lao Lao 4.30 239
Tibetan Tibetan 7.93 1,849
Burmese Myanmar 7.17 748
Khmer Khmer 5.52 567

Lao fix, before/after:

Version Lao bytes/token Lao vocab coverage Status
V.4.4 32K 1.02 0 byte-fallback collapse
V.4.6 32K 4.30 239 fixed

Known limitations

All 27 targeted scripts are present in the vocabulary, and 26 have strong coverage (hundreds to thousands of tokens each). One targeted script and two out-of-scope scripts are under-resourced; all encode losslessly but inefficiently.

In-scope (targeted script, under-resourced) - priority fix for V.4.7:

Language Script Bytes/token Vocab tokens Notes
Dhivehi (dv) Thaana ~1.49 6 only diacritics, no base letters

Thaana is one of the 27 targeted scripts but received almost no merges (6 diacritic tokens). It is not in FLORES-200, so it was found by auditing the shipped vocabulary against the corpus language list rather than via the benchmark. Bringing Dhivehi up to usable coverage is the primary goal of V.4.7, and is the same class of fix as the Lao repair in this release.

Out-of-scope (not in the 72-language target) - candidates for future coverage:

Language Script Bytes/token
Tamasheq (taq_Tfng) Tifinagh ~1.04
Central Atlas Tamazight (tzm_Tfng) Tifinagh ~1.04
Santali (sat_Olck) Ol Chiki ~1.07

These three were never part of the V.4.6 language target; they surfaced because the FLORES-200 benchmark covers more languages than the tokenizer was trained for. They are reported here for transparency and are candidates for a later expansion, but were not a V.4.6 goal.

Special tokens

Reserved structural tokens include chat roles (<|system|>, <|user|>, <|assistant|>), tool use (<|tool_call|>, <|tool_result|>), reasoning (<|thinking|>, <|/thinking|>), code (<|code|>, <|/code|>), and a set of language tags (<|lang:en|>, etc.), alongside the usual <|bos|>, <|endoftext|>, <|padding|>, <|unk|>, <|sep|>.

Version history

  • V.4.6 - Lao fixed (0 → 239 tokens); corpus re-cleaned; health-gated.
  • V.4.5 - First Lao coverage (201 tokens); Lao byte-fallback resolved.
  • V.4.4 - 32K; Lao absent (byte-fallback).
  • V.4.1 - 32K and 64K variants.

About

QuartzTokenizer is developed by the Quartz open-source division of AENEA Global Ltd. Within AENEA Global Ltd, Quartz is the open-source arm - it publishes the tokenizers, data pipelines, and open models. The 32K QuartzTokenizer variant is the tokenizer for the Prelude model family.

License

Apache 2.0. The FLORES-200 benchmark data used for evaluation is licensed CC BY-SA 4.0 by its authors and is not redistributed here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support