Instructions to use Mapika/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Mapika/GLM-5.2-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Mapika/GLM-5.2-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Mapika/GLM-5.2-NVFP4")
model = AutoModelForCausalLM.from_pretrained("Mapika/GLM-5.2-NVFP4", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

TensorRT

How to use Mapika/GLM-5.2-NVFP4 with TensorRT:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Mapika/GLM-5.2-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Mapika/GLM-5.2-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Mapika/GLM-5.2-NVFP4

SGLang

How to use Mapika/GLM-5.2-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Mapika/GLM-5.2-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Mapika/GLM-5.2-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mapika/GLM-5.2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Mapika/GLM-5.2-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Mapika/GLM-5.2-NVFP4
```

"Fits on 4× ≥80 GB GPUs"

by fraserprice - opened Jun 17

Discussion

fraserprice

Jun 17

"Fits on 4× ≥80 GB GPUs at TP=4 (~110 GB/GPU)"

do I need to download more RAM for my 80GB GPU? 🤔

Mapika

Owner Jun 17

Good catch, that line was just wrong and I have fixed it in the model card. You do not need more RAM :) it is a tensor-parallel thing, not a RAM thing. At --tp 4 the weights are about 110 GB per GPU, which will not fit an 80 GB card. For 80 GB GPUs use --tp 8 instead (about 55 GB of weights per GPU), so 8x H100 or A100-80GB works fine. --tp 4 is meant for cards with 128 GB or more (H200, B200, MI300X). Thanks for flagging it.

abishekhtw

Jun 17

Anyway to get it running on 7* 95 GB GPU's (ie. 7* RTX pro 6000) ??

Mapika

Owner Jun 17

7 is an awkward number here :) Tensor parallel has to divide the model evenly, and GLM-5.2's dims are all powers-of-2 (64 attention heads, 256 experts, 6144 hidden, 2048 MoE intermediate), so only TP=2/4/8 are valid (TP=6 and TP=7 are not). The catch: TP=4 needs about 102 GB of weights per GPU, just over your 96 GB, while TP=8 fits nicely (~51 GB/GPU) but needs 8 GPUs.

So two options:

Add an 8th RTX Pro 6000 and run --tp 8. Cleanest and fastest.
With exactly 7 GPUs, use pipeline parallelism instead: sglang --pp-size 7 --tp-size 1 splits the 78 layers across the cards (~11 layers, ~58 GB each, fits). Pipeline parallel has lower throughput than TP and the PP + DSA + NVFP4 path is less battle-tested, so verify output, but memory-wise it works.

Also note RTX Pro 6000 is Blackwell sm_120, while the NVFP4 cutlass MoE kernels are mainly tuned for datacenter Blackwell (sm_100/103), so sanity-check generation quality on your cards.

abishekhtw

Jun 17

Thanks for the detailed response. I’ve run into similar issues with other models as well. Have experimented a bit with pipeline parallelism, but I kept hitting roadblocks and haven’t had much time to dig deeper. I’ll give this approach a try and see how it goes, but if it ends up reducing throughput, I’ll need to compare it against the output from the upcoming GGUFs.
Adding one more GPU would probably make things a lot easier.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment