GadflyII/GLM-4.7-Flash-NVFP4
What version of transformers are you running?
What version of transformers are you running?
5.0
try with :
--gpu-memory-utilization 0.85
Also, what did you set --max-model-len at?
Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.
Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 8000 \
--trust-remote-code \
--max-num-seqs 1 \
--gpu-memory-utilization 0.96 \
--swap-space 16 \
--enforce-eager \
--max-num-seqs 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000
The following is what I use for this quant:
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 80000 \
--trust-remote-code \
--max-num-seqs 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000
Also, what did you set --max-model-len at?
4096
Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.
Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ --download-dir /mnt/models/llm \ --kv-cache-dtype fp8 \ --tensor-parallel-size 2 \ --max-model-len 8000 \ --trust-remote-code \ --max-num-seqs 1 \ --gpu-memory-utilization 0.96 \ --swap-space 16 \ --enforce-eager \ --max-num-seqs 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --host 0.0.0.0 --port 8000The following is what I use for this quant:
export PYTORCH_ALLOC_CONF=expandable_segments:True uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ --download-dir /mnt/models/llm \ --kv-cache-dtype fp8 \ --tensor-parallel-size 2 \ --max-model-len 80000 \ --trust-remote-code \ --max-num-seqs 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash \ --host 0.0.0.0 --port 8000
i try still can't
you have 1 GPU or 2? " --tensor-parallel-size 1" for single GPU. Are you sure that nothing else is using your GPU's memory?
vllm serve GadflyII/GLM-4.7-Flash-NVFP4
--download-dir /mnt/models/llm
--tensor-parallel-size 1
--max-model-len 4096
--gpu-memory-utilization 0.90 \ #decrease this number if you get OOM's
--kv-cache-dtype fp8
--trust-remote-code
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0 --port 8000
Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.
services:
vllm-node:
image: vllm-transformers5
container_name: vllm-io
environment:
- VLLM_API_SERVER_COUNT=2
restart: unless-stopped
# Networking and Privileges
privileged: true
network_mode: host
ipc: host
pid: host
# GPU Access
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Command: Keeps your bash wrapper to ensure environment variables load
command: >
bash -c -i "vllm serve
GadflyII/GLM-4.7-Flash-NVFP4
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.7
--load-format fastsafetensors"
Then I get the error `usage: vllm serve [model_tag] [options]
vllm serve: error: argument --compilation-config/-cc: expected one argument`
I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.
Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.services: vllm-node: image: vllm-transformers5 container_name: vllm-io environment: - VLLM_API_SERVER_COUNT=2 restart: unless-stopped # Networking and Privileges privileged: true network_mode: host ipc: host pid: host # GPU Access deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] # Command: Keeps your bash wrapper to ensure environment variables load command: > bash -c -i "vllm serve GadflyII/GLM-4.7-Flash-NVFP4 --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 --load-format fastsafetensors"Then I get the error `usage: vllm serve [model_tag] [options]
vllm serve: error: argument --compilation-config/-cc: expected one argument`
I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.
maybe give it a try here: https://github.com/eugr/spark-vllm-docker
To add to the previous comment - if using https://github.com/eugr/spark-vllm-docker with DGX Spark, make sure you build using --pre-tf flag, so it includes Transformers 5.
To run this model (just tested it).
Build:
./build-and-copy.sh \
-t vllm-node-20260122-whl-tf5 \
--use-wheels --pre-tf --pre-flashinfer \
--rebuild-vllm --rebuild-deps
vllm serve arguments:
vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7 \
--max-model-len 32768 \
--host 0.0.0.0 --port 8888
One note - NVFP4 performance in vLLM on Spark is not great currently. You will get much better performance from AWQ quants or even FP8!



