Instructions to use festr2/GLM-5.2-Int8Mix-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use festr2/GLM-5.2-Int8Mix-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="festr2/GLM-5.2-Int8Mix-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("festr2/GLM-5.2-Int8Mix-NVFP4")
model = AutoModelForCausalLM.from_pretrained("festr2/GLM-5.2-Int8Mix-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use festr2/GLM-5.2-Int8Mix-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "festr2/GLM-5.2-Int8Mix-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "festr2/GLM-5.2-Int8Mix-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/festr2/GLM-5.2-Int8Mix-NVFP4

SGLang

How to use festr2/GLM-5.2-Int8Mix-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "festr2/GLM-5.2-Int8Mix-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "festr2/GLM-5.2-Int8Mix-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "festr2/GLM-5.2-Int8Mix-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "festr2/GLM-5.2-Int8Mix-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use festr2/GLM-5.2-Int8Mix-NVFP4 with Docker Model Runner:
```
docker model run hf.co/festr2/GLM-5.2-Int8Mix-NVFP4
```

GLM-5.2 NVFP4 Int8Mix

This is an experimental hybrid GLM-5.2 checkpoint for vLLM/B12X serving.

It combines:

the dense, attention, shared-expert, special-head, and MTP tensors from QuantTrio/GLM-5.2-Int4-Int8Mix;
the non-shared routed MoE expert MLP projections from lukealonso/GLM-5.2-NVFP4;
a vLLM-compatible compressed-tensors config update for the fused GLM-5.2 runtime module names used by current vLLM.

The repository name currently contains MTPFix because the first upload used that internal working name. The actual checkpoint identity is better described as GLM-5.2 NVFP4 Int8Mix.

Provenance

Base model:

zai-org/GLM-5.2

Quantized sources:

This is not a full re-quantization from BF16. It is a merged checkpoint:

QuantTrio supplies the W8A16 dense/attention/shared/MTP parts and BF16 unquantized tensors.
Luke NVFP4 supplies the routed expert MLP projections for layers 3-77.
The config was adjusted so vLLM can load the fused MTP names (mtp_block, fused_qkv_a_proj) used at runtime.

Quantization layout

The effective quantization_config uses compressed-tensors with format: nvfp4-pack-quantized.

Scope	Format
`model.layers.0` and ignored special paths	BF16
Dense attention and ordinary linear weights in layers 1-77	W8A16 INT8, symmetric group quantization, group size 128
Shared experts in layers 1-77	W8A16 INT8, symmetric group quantization, group size 128
Non-shared routed MoE experts in layers 3-77	NVFP4-style float 4-bit weights, tensor-group strategy, group size 16
Layer 78 MTP block	W8A16 INT8, channel-wise
`mlp.gate`, attention indexer, norms, embeddings, and special heads	BF16 / ignored

Compared with the original QuantTrio checkpoint, the routed expert tensors are not INT4 group-size-128 weights anymore. They are replaced by Luke's NVFP4 expert tensors.

Compared with Luke's NVFP4 checkpoint, this checkpoint does not keep the dense and attention parts in the same BF16/NVFP4 ModelOpt layout. Those parts come from QuantTrio's compact W8A16 export.

Notes on NVFP4 expert quality

Luke's NVFP4 checkpoint quantizes directly from the BF16 GLM-5.2 checkpoint using NVIDIA Model Optimizer. In that source checkpoint, only the non-shared MoE expert MLP projections are quantized to NVFP4; attention weights, early dense MLP layers, and shared experts are left unquantized. The calibration uses natural top-k routing rather than forcing all experts active, with broad sample coverage to better match the distributions experts see during inference.

That matters for this hybrid checkpoint because the routed MoE experts are the largest parameter component and the most routing-sensitive part of GLM-5.2. NVFP4 uses small 16-value floating-point blocks with FP8 scale metadata, while the original QuantTrio expert path uses integer 4-bit group quantization with group size 128. The finer scaling granularity is one reason the NVFP4 expert path can preserve the BF16 distribution better in local KLD tests.

Measured local distribution quality

KLD/JS is a local next-token distribution proxy, not a full model-quality benchmark. It is useful for detecting numerical regressions, but deployment quality should also be checked with long-context tasks, coding prompts, tool calling, repetition/CJK watchdogs, MTP acceptance, throughput, and VRAM.

Repeated local KLD measurements from the vLLM/B12X test stack showed:

Checkpoint	Prefill KLD mean	Decode JS mean
Luke NVFP4	`0.068257`	`0.00000236`
QuantTrio GLM-5.2 Int4-Int8Mix	`0.070448`	`0.00000286`
This hybrid, W8A16 + Luke NVFP4 experts	`0.071182`	`0.00000264`

Interpretation:

Luke NVFP4 remains the strongest of these practical-size checkpoints in the repeated local distribution test.
This hybrid is close to QuantTrio on prefill and slightly better on repeated decode JS in that run set, but the decode differences are small and overlap run-to-run variance.
Do not treat KLD alone as a final quality ranking. It is one signal.

Serving status

This checkpoint was prepared for the local vLLM/B12X GLM-5.2 stack used by local-inference-lab/rtx6kpro. It is not claimed to be a generic drop-in model for every runtime.

Known working class of configuration:

vLLM with GLM-5.2 support
--quantization compressed-tensors
--kv-cache-dtype fp8
--attention-backend B12X_MLA_SPARSE
--moe-backend b12x
B12X A16 expert serving supported

Example shape used in local testing:

vllm serve /path/to/GLM-5.2-NVFP4-Int8Mix \
  --served-model-name GLM-5.2 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 1 \
  --quantization compressed-tensors \
  --attention-backend B12X_MLA_SPARSE \
  --moe-backend b12x \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45

For exact Docker images and launch recipes used in local benchmarking, see the GLM-5.2 v12 notes in:

https://github.com/local-inference-lab/rtx6kpro/blob/master/models/glm5.2_v12.md

File size

Approximate uploaded size: 409.33 GiB.

License

The model card inherits the MIT license metadata from the source GLM-5.2 release and source model cards. Check the upstream model cards for complete license and usage details.

Downloads last month: 180

Model tree for festr2/GLM-5.2-Int8Mix-NVFP4

Base model

zai-org/GLM-5.2

Quantized

QuantTrio/GLM-5.2-Int4-Int8Mix

Quantized

(1)

this model