Instructions to use GadflyII/GLM-4.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GadflyII/GLM-4.7-Flash-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GadflyII/GLM-4.7-Flash-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4")
model = AutoModelForCausalLM.from_pretrained("GadflyII/GLM-4.7-Flash-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GadflyII/GLM-4.7-Flash-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GadflyII/GLM-4.7-Flash-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4

SGLang

How to use GadflyII/GLM-4.7-Flash-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GadflyII/GLM-4.7-Flash-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GadflyII/GLM-4.7-Flash-NVFP4 with Docker Model Runner:
```
docker model run hf.co/GadflyII/GLM-4.7-Flash-NVFP4
```

GadflyII/GLM-4.7-Flash-NVFP4

by Yu21342 - opened Jan 20

Discussion

Yu21342

Jan 20

WSL2 +5090 +PYTHON3.11 wrong

GadflyII

Owner Jan 20

What version of transformers are you running?

Yu21342

Jan 20

5.0

Yu21342

Jan 20

What version of transformers are you running?

5.0

GadflyII

Owner Jan 20

try with :

--gpu-memory-utilization 0.85

GadflyII

Owner Jan 20

Also, what did you set --max-model-len at?

loktar

Jan 21

Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.

Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --trust-remote-code \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --swap-space 16 \
  --enforce-eager \
  --max-num-seqs 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

The following is what I use for this quant:

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 80000 \
  --trust-remote-code \
  --max-num-seqs 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

Yu21342

Jan 21

pip install git+https://github.com/huggingface/transformers.git has an error

Yu21342

Jan 21

Also, what did you set --max-model-len at?

4096

Yu21342

Jan 21

Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.

Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --trust-remote-code \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --swap-space 16 \
  --enforce-eager \
  --max-num-seqs 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

The following is what I use for this quant:

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 80000 \
  --trust-remote-code \
  --max-num-seqs 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

i try still can't

Yu21342

Jan 21

This comment has been hidden (marked as Off-Topic)

GadflyII

Owner Jan 21

you have 1 GPU or 2? " --tensor-parallel-size 1" for single GPU. Are you sure that nothing else is using your GPU's memory?

vllm serve GadflyII/GLM-4.7-Flash-NVFP4
--download-dir /mnt/models/llm
--tensor-parallel-size 1
--max-model-len 4096
--gpu-memory-utilization 0.90 \ #decrease this number if you get OOM's
--kv-cache-dtype fp8
--trust-remote-code
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0 --port 8000

pathosethoslogos

Jan 21

•

edited Jan 21

Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.

services:
  vllm-node:
    image: vllm-transformers5
    container_name: vllm-io
    environment: 
      - VLLM_API_SERVER_COUNT=2
    restart: unless-stopped
    
    # Networking and Privileges
    privileged: true
    network_mode: host
    ipc: host
    pid: host

    # GPU Access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Command: Keeps your bash wrapper to ensure environment variables load
    command: >
      bash -c -i "vllm serve 
      GadflyII/GLM-4.7-Flash-NVFP4 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors"

Then I get the error `usage: vllm serve [model_tag] [options]

vllm serve: error: argument --compilation-config/-cc: expected one argument`

I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.

GadflyII

Owner Jan 21

ehh... not sure about that one, try:

docker run --rm -it --gpus all vllm-transformers5 bash

Then manually run:

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 --load-format fastsafetensors

MJPansa

Jan 21

Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.
services:
  vllm-node:
    image: vllm-transformers5
    container_name: vllm-io
    environment: 
      - VLLM_API_SERVER_COUNT=2
    restart: unless-stopped
    
    # Networking and Privileges
    privileged: true
    network_mode: host
    ipc: host
    pid: host

    # GPU Access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Command: Keeps your bash wrapper to ensure environment variables load
    command: >
      bash -c -i "vllm serve 
      GadflyII/GLM-4.7-Flash-NVFP4 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors"
Then I get the error `usage: vllm serve [model_tag] [options]

vllm serve: error: argument --compilation-config/-cc: expected one argument`

I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.

maybe give it a try here: https://github.com/eugr/spark-vllm-docker

eugreugr

Jan 22

To add to the previous comment - if using https://github.com/eugr/spark-vllm-docker with DGX Spark, make sure you build using --pre-tf flag, so it includes Transformers 5.
To run this model (just tested it).

Build:

./build-and-copy.sh \
 -t vllm-node-20260122-whl-tf5 \
--use-wheels --pre-tf --pre-flashinfer \
--rebuild-vllm --rebuild-deps

vllm serve arguments:

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--tool-call-parser glm47  \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7 \
--max-model-len 32768 \
--host 0.0.0.0 --port 8888

One note - NVFP4 performance in vLLM on Spark is not great currently. You will get much better performance from AWQ quants or even FP8!

GadflyII changed discussion status to closed Jan 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment