Instructions to use GadflyII/GLM-4.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GadflyII/GLM-4.7-Flash-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="GadflyII/GLM-4.7-Flash-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4") model = AutoModelForCausalLM.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use GadflyII/GLM-4.7-Flash-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "GadflyII/GLM-4.7-Flash-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4
- SGLang
How to use GadflyII/GLM-4.7-Flash-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use GadflyII/GLM-4.7-Flash-NVFP4 with Docker Model Runner:
docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4
GadflyII/GLM-4.7-Flash-NVFP4
What version of transformers are you running?
What version of transformers are you running?
5.0
try with :
--gpu-memory-utilization 0.85
Also, what did you set --max-model-len at?
Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.
Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 8000 \
--trust-remote-code \
--max-num-seqs 1 \
--gpu-memory-utilization 0.96 \
--swap-space 16 \
--enforce-eager \
--max-num-seqs 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000
The following is what I use for this quant:
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 80000 \
--trust-remote-code \
--max-num-seqs 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000
Also, what did you set --max-model-len at?
4096
Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.
Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ --download-dir /mnt/models/llm \ --kv-cache-dtype fp8 \ --tensor-parallel-size 2 \ --max-model-len 8000 \ --trust-remote-code \ --max-num-seqs 1 \ --gpu-memory-utilization 0.96 \ --swap-space 16 \ --enforce-eager \ --max-num-seqs 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --host 0.0.0.0 --port 8000The following is what I use for this quant:
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ --download-dir /mnt/models/llm \ --kv-cache-dtype fp8 \ --tensor-parallel-size 2 \ --max-model-len 80000 \ --trust-remote-code \ --max-num-seqs 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --host 0.0.0.0 --port 8000
i try still can't
you have 1 GPU or 2? " --tensor-parallel-size 1" for single GPU. Are you sure that nothing else is using your GPU's memory?
vllm serve GadflyII/GLM-4.7-Flash-NVFP4
--download-dir /mnt/models/llm
--tensor-parallel-size 1
--max-model-len 4096
--gpu-memory-utilization 0.90 \ #decrease this number if you get OOM's
--kv-cache-dtype fp8
--trust-remote-code
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0 --port 8000
Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.
services:
vllm-node:
image: vllm-transformers5
container_name: vllm-io
environment:
- VLLM_API_SERVER_COUNT=2
restart: unless-stopped
# Networking and Privileges
privileged: true
network_mode: host
ipc: host
pid: host
# GPU Access
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Command: Keeps your bash wrapper to ensure environment variables load
command: >
bash -c -i "vllm serve
GadflyII/GLM-4.7-Flash-NVFP4
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.7
--load-format fastsafetensors"
Then I get the error `usage: vllm serve [model_tag] [options]
vllm serve: error: argument --compilation-config/-cc: expected one argument`
I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.
Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.services: vllm-node: image: vllm-transformers5 container_name: vllm-io environment: - VLLM_API_SERVER_COUNT=2 restart: unless-stopped # Networking and Privileges privileged: true network_mode: host ipc: host pid: host # GPU Access deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] # Command: Keeps your bash wrapper to ensure environment variables load command: > bash -c -i "vllm serve GadflyII/GLM-4.7-Flash-NVFP4 --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 --load-format fastsafetensors"Then I get the error `usage: vllm serve [model_tag] [options]
vllm serve: error: argument --compilation-config/-cc: expected one argument`
I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.
maybe give it a try here: https://github.com/eugr/spark-vllm-docker
To add to the previous comment - if using https://github.com/eugr/spark-vllm-docker with DGX Spark, make sure you build using --pre-tf flag, so it includes Transformers 5.
To run this model (just tested it).
Build:
./build-and-copy.sh \
-t vllm-node-20260122-whl-tf5 \
--use-wheels --pre-tf --pre-flashinfer \
--rebuild-vllm --rebuild-deps
vllm serve arguments:
vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7 \
--max-model-len 32768 \
--host 0.0.0.0 --port 8888
One note - NVFP4 performance in vLLM on Spark is not great currently. You will get much better performance from AWQ quants or even FP8!



