Instructions to use festr2/GLM-5.2-Int8Mix-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use festr2/GLM-5.2-Int8Mix-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="festr2/GLM-5.2-Int8Mix-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("festr2/GLM-5.2-Int8Mix-NVFP4") model = AutoModelForCausalLM.from_pretrained("festr2/GLM-5.2-Int8Mix-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use festr2/GLM-5.2-Int8Mix-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "festr2/GLM-5.2-Int8Mix-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "festr2/GLM-5.2-Int8Mix-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/festr2/GLM-5.2-Int8Mix-NVFP4
- SGLang
How to use festr2/GLM-5.2-Int8Mix-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "festr2/GLM-5.2-Int8Mix-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "festr2/GLM-5.2-Int8Mix-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "festr2/GLM-5.2-Int8Mix-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "festr2/GLM-5.2-Int8Mix-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use festr2/GLM-5.2-Int8Mix-NVFP4 with Docker Model Runner:
docker model run hf.co/festr2/GLM-5.2-Int8Mix-NVFP4
GLM-5.2 NVFP4 Int8Mix
This is an experimental hybrid GLM-5.2 checkpoint for vLLM/B12X serving.
It combines:
- the dense, attention, shared-expert, special-head, and MTP tensors from QuantTrio/GLM-5.2-Int4-Int8Mix;
- the non-shared routed MoE expert MLP projections from lukealonso/GLM-5.2-NVFP4;
- a vLLM-compatible
compressed-tensorsconfig update for the fused GLM-5.2 runtime module names used by current vLLM.
The repository name currently contains MTPFix because the first upload used
that internal working name. The actual checkpoint identity is better described
as GLM-5.2 NVFP4 Int8Mix.
Provenance
Base model:
Quantized sources:
This is not a full re-quantization from BF16. It is a merged checkpoint:
- QuantTrio supplies the W8A16 dense/attention/shared/MTP parts and BF16 unquantized tensors.
- Luke NVFP4 supplies the routed expert MLP projections for layers 3-77.
- The config was adjusted so vLLM can load the fused MTP names
(
mtp_block,fused_qkv_a_proj) used at runtime.
Quantization layout
The effective quantization_config uses compressed-tensors with
format: nvfp4-pack-quantized.
| Scope | Format |
|---|---|
model.layers.0 and ignored special paths |
BF16 |
| Dense attention and ordinary linear weights in layers 1-77 | W8A16 INT8, symmetric group quantization, group size 128 |
| Shared experts in layers 1-77 | W8A16 INT8, symmetric group quantization, group size 128 |
| Non-shared routed MoE experts in layers 3-77 | NVFP4-style float 4-bit weights, tensor-group strategy, group size 16 |
| Layer 78 MTP block | W8A16 INT8, channel-wise |
mlp.gate, attention indexer, norms, embeddings, and special heads |
BF16 / ignored |
Compared with the original QuantTrio checkpoint, the routed expert tensors are not INT4 group-size-128 weights anymore. They are replaced by Luke's NVFP4 expert tensors.
Compared with Luke's NVFP4 checkpoint, this checkpoint does not keep the dense and attention parts in the same BF16/NVFP4 ModelOpt layout. Those parts come from QuantTrio's compact W8A16 export.
Notes on NVFP4 expert quality
Luke's NVFP4 checkpoint quantizes directly from the BF16 GLM-5.2 checkpoint using NVIDIA Model Optimizer. In that source checkpoint, only the non-shared MoE expert MLP projections are quantized to NVFP4; attention weights, early dense MLP layers, and shared experts are left unquantized. The calibration uses natural top-k routing rather than forcing all experts active, with broad sample coverage to better match the distributions experts see during inference.
That matters for this hybrid checkpoint because the routed MoE experts are the largest parameter component and the most routing-sensitive part of GLM-5.2. NVFP4 uses small 16-value floating-point blocks with FP8 scale metadata, while the original QuantTrio expert path uses integer 4-bit group quantization with group size 128. The finer scaling granularity is one reason the NVFP4 expert path can preserve the BF16 distribution better in local KLD tests.
Measured local distribution quality
KLD/JS is a local next-token distribution proxy, not a full model-quality benchmark. It is useful for detecting numerical regressions, but deployment quality should also be checked with long-context tasks, coding prompts, tool calling, repetition/CJK watchdogs, MTP acceptance, throughput, and VRAM.
Repeated local KLD measurements from the vLLM/B12X test stack showed:
| Checkpoint | Prefill KLD mean | Decode JS mean |
|---|---|---|
| Luke NVFP4 | 0.068257 |
0.00000236 |
| QuantTrio GLM-5.2 Int4-Int8Mix | 0.070448 |
0.00000286 |
| This hybrid, W8A16 + Luke NVFP4 experts | 0.071182 |
0.00000264 |
Interpretation:
- Luke NVFP4 remains the strongest of these practical-size checkpoints in the repeated local distribution test.
- This hybrid is close to QuantTrio on prefill and slightly better on repeated decode JS in that run set, but the decode differences are small and overlap run-to-run variance.
- Do not treat KLD alone as a final quality ranking. It is one signal.
Serving status
This checkpoint was prepared for the local vLLM/B12X GLM-5.2 stack used by local-inference-lab/rtx6kpro. It is not claimed to be a generic drop-in model for every runtime.
Known working class of configuration:
- vLLM with GLM-5.2 support
--quantization compressed-tensors--kv-cache-dtype fp8--attention-backend B12X_MLA_SPARSE--moe-backend b12x- B12X A16 expert serving supported
Example shape used in local testing:
vllm serve /path/to/GLM-5.2-NVFP4-Int8Mix \
--served-model-name GLM-5.2 \
--trust-remote-code \
--tensor-parallel-size 8 \
--decode-context-parallel-size 1 \
--quantization compressed-tensors \
--attention-backend B12X_MLA_SPARSE \
--moe-backend b12x \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45
For exact Docker images and launch recipes used in local benchmarking, see the GLM-5.2 v12 notes in:
https://github.com/local-inference-lab/rtx6kpro/blob/master/models/glm5.2_v12.md
File size
Approximate uploaded size: 409.33 GiB.
License
The model card inherits the MIT license metadata from the source GLM-5.2 release and source model cards. Check the upstream model cards for complete license and usage details.
- Downloads last month
- 180