Instructions to use SceneWorks/kolors-chatglm3-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use SceneWorks/kolors-chatglm3-tokenizer with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir kolors-chatglm3-tokenizer SceneWorks/kolors-chatglm3-tokenizer
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Kolors ChatGLM3 fast tokenizer (tokenizer.json)
A derived artifact for running Kwai-Kolors/Kolors-diffusers on the native-Rust/MLX mlx-gen engine (SceneWorks, epic 3090).
Why this exists
Kolors conditions on ChatGLM3-6B, which ships only a slow SentencePiece tokenizer
(tokenizer.model + a custom ChatGLMTokenizer(PreTrainedTokenizer)) — there is no fast
tokenizer.json in the upstream repo. The Rust engine's tokenizer loader (mlx_gen::TextTokenizer,
consumed by the Kolors generator and the Kolors LoRA/LoKr trainer) reads the HF tokenizers fast
serialization, so it needs a tokenizer.json.
This repo hosts that derived tokenizer.json so SceneWorks model-install can overlay it onto the
upstream Kolors-diffusers snapshot (instead of running a Python SentencePiece→fast conversion at
install time on every machine — a Python-eradication consideration, epic 3482).
How it was built
Materialized by tools/build_kolors_tokenizer.py (mlx-gen): converts the ChatGLM3 SP model to a fast
tokenizer.json via transformers' SP converter. The fast tokenizer reproduces the SP content ids
exactly; it adds no special tokens — the ChatGLM [gMASK] (64790) / sop (64792) prefix, left-pad,
and position_ids are applied by the Rust KolorsTokenizer wrapper (matching
build_inputs_with_special_tokens + _pad, max_length=256).
Validation: fast-tokenizer content ids == sp_model.encode(text) across an EN + EN-long(truncation)
- CN + mixed CN/EN + empty(negative-prompt) battery; special-token ids asserted
(
[gMASK]=64790,sop=64792, pad=unk=0).
Files
tokenizer.json— the derived fast tokenizer (the file the Rust engine needs).tokenizer.model— the upstream ChatGLM3 SentencePiece model (provenance / reproducibility).tokenizer_config.json— the upstream tokenizer config.
License & provenance
Derived from the ChatGLM3 tokenizer shipped with Kwai-Kolors/Kolors-diffusers. Use is governed by the upstream Kolors model license and the ChatGLM3-6B license. This repo redistributes only the tokenizer (no model weights) for engine interoperability.