Kolors ChatGLM3 fast tokenizer (`tokenizer.json`)

A derived artifact for running Kwai-Kolors/Kolors-diffusers on the native-Rust/MLX mlx-gen engine (SceneWorks, epic 3090).

Why this exists

Kolors conditions on ChatGLM3-6B, which ships only a slow SentencePiece tokenizer (tokenizer.model + a custom ChatGLMTokenizer(PreTrainedTokenizer)) — there is no fast tokenizer.json in the upstream repo. The Rust engine's tokenizer loader (mlx_gen::TextTokenizer, consumed by the Kolors generator and the Kolors LoRA/LoKr trainer) reads the HF tokenizers fast serialization, so it needs a tokenizer.json.

This repo hosts that derived tokenizer.json so SceneWorks model-install can overlay it onto the upstream Kolors-diffusers snapshot (instead of running a Python SentencePiece→fast conversion at install time on every machine — a Python-eradication consideration, epic 3482).

How it was built

Materialized by tools/build_kolors_tokenizer.py (mlx-gen): converts the ChatGLM3 SP model to a fast tokenizer.json via transformers' SP converter. The fast tokenizer reproduces the SP content ids exactly; it adds no special tokens — the ChatGLM [gMASK] (64790) / sop (64792) prefix, left-pad, and position_ids are applied by the Rust KolorsTokenizer wrapper (matching build_inputs_with_special_tokens + _pad, max_length=256).

Validation: fast-tokenizer content ids == sp_model.encode(text) across an EN + EN-long(truncation)

CN + mixed CN/EN + empty(negative-prompt) battery; special-token ids asserted ([gMASK]=64790, sop=64792, pad=unk=0).

Files

tokenizer.json — the derived fast tokenizer (the file the Rust engine needs).
tokenizer.model — the upstream ChatGLM3 SentencePiece model (provenance / reproducibility).
tokenizer_config.json — the upstream tokenizer config.

License & provenance

Derived from the ChatGLM3 tokenizer shipped with Kwai-Kolors/Kolors-diffusers. Use is governed by the upstream Kolors model license and the ChatGLM3-6B license. This repo redistributes only the tokenizer (no model weights) for engine interoperability.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Kolors ChatGLM3 fast tokenizer (tokenizer.json)

Why this exists

How it was built

Files

License & provenance

Kolors ChatGLM3 fast tokenizer (`tokenizer.json`)