Instructions to use Mapika/GLM-5.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Mapika/GLM-5.2-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Mapika/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Mapika/GLM-5.2-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Mapika/GLM-5.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - TensorRT
How to use Mapika/GLM-5.2-NVFP4 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Mapika/GLM-5.2-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mapika/GLM-5.2-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mapika/GLM-5.2-NVFP4
- SGLang
How to use Mapika/GLM-5.2-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mapika/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mapika/GLM-5.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mapika/GLM-5.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mapika/GLM-5.2-NVFP4 with Docker Model Runner:
docker model run hf.co/Mapika/GLM-5.2-NVFP4
"Fits on 4× ≥80 GB GPUs"
"Fits on 4× ≥80 GB GPUs at TP=4 (~110 GB/GPU)"
do I need to download more RAM for my 80GB GPU? 🤔
Good catch, that line was just wrong and I have fixed it in the model card. You do not need more RAM :) it is a tensor-parallel thing, not a RAM thing. At --tp 4 the weights are about 110 GB per GPU, which will not fit an 80 GB card. For 80 GB GPUs use --tp 8 instead (about 55 GB of weights per GPU), so 8x H100 or A100-80GB works fine. --tp 4 is meant for cards with 128 GB or more (H200, B200, MI300X). Thanks for flagging it.
Anyway to get it running on 7* 95 GB GPU's (ie. 7* RTX pro 6000) ??
7 is an awkward number here :) Tensor parallel has to divide the model evenly, and GLM-5.2's dims are all powers-of-2 (64 attention heads, 256 experts, 6144 hidden, 2048 MoE intermediate), so only TP=2/4/8 are valid (TP=6 and TP=7 are not). The catch: TP=4 needs about 102 GB of weights per GPU, just over your 96 GB, while TP=8 fits nicely (~51 GB/GPU) but needs 8 GPUs.
So two options:
- Add an 8th RTX Pro 6000 and run --tp 8. Cleanest and fastest.
- With exactly 7 GPUs, use pipeline parallelism instead: sglang --pp-size 7 --tp-size 1 splits the 78 layers across the cards (~11 layers, ~58 GB each, fits). Pipeline parallel has lower throughput than TP and the PP + DSA + NVFP4 path is less battle-tested, so verify output, but memory-wise it works.
Also note RTX Pro 6000 is Blackwell sm_120, while the NVFP4 cutlass MoE kernels are mainly tuned for datacenter Blackwell (sm_100/103), so sanity-check generation quality on your cards.
Thanks for the detailed response. I’ve run into similar issues with other models as well. Have experimented a bit with pipeline parallelism, but I kept hitting roadblocks and haven’t had much time to dig deeper. I’ll give this approach a try and see how it goes, but if it ends up reducing throughput, I’ll need to compare it against the output from the upcoming GGUFs.
Adding one more GPU would probably make things a lot easier.