GadflyII/GLM-4.7-Flash-NVFP4

#3
by Yu21342 - opened

WSL2 +5090 +PYTHON3.11 wrong
image

image

What version of transformers are you running?

5.0

image

What version of transformers are you running?

5.0

try with :

--gpu-memory-utilization 0.85

Also, what did you set --max-model-len at?

Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.

Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --trust-remote-code \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --swap-space 16 \
  --enforce-eager \
  --max-num-seqs 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

The following is what I use for this quant:

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 80000 \
  --trust-remote-code \
  --max-num-seqs 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

Also, what did you set --max-model-len at?

4096

Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.

Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --trust-remote-code \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --swap-space 16 \
  --enforce-eager \
  --max-num-seqs 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

The following is what I use for this quant:

export PYTORCH_ALLOC_CONF=expandable_segments:True

uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
  --download-dir /mnt/models/llm \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 80000 \
  --trust-remote-code \
  --max-num-seqs 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --host 0.0.0.0 --port 8000

i try still can't

This comment has been hidden (marked as Off-Topic)

you have 1 GPU or 2? " --tensor-parallel-size 1" for single GPU. Are you sure that nothing else is using your GPU's memory?

vllm serve GadflyII/GLM-4.7-Flash-NVFP4
--download-dir /mnt/models/llm
--tensor-parallel-size 1
--max-model-len 4096
--gpu-memory-utilization 0.90 \ #decrease this number if you get OOM's
--kv-cache-dtype fp8
--trust-remote-code
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.7-flash
--host 0.0.0.0 --port 8000

Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.

services:
  vllm-node:
    image: vllm-transformers5
    container_name: vllm-io
    environment: 
      - VLLM_API_SERVER_COUNT=2
    restart: unless-stopped
    
    # Networking and Privileges
    privileged: true
    network_mode: host
    ipc: host
    pid: host

    # GPU Access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Command: Keeps your bash wrapper to ensure environment variables load
    command: >
      bash -c -i "vllm serve 
      GadflyII/GLM-4.7-Flash-NVFP4 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors"

Then I get the error `usage: vllm serve [model_tag] [options]

vllm serve: error: argument --compilation-config/-cc: expected one argument`

I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.

ehh... not sure about that one, try:

docker run --rm -it --gpus all vllm-transformers5 bash

Then manually run:

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 --load-format fastsafetensors

Any tips for Docker compose on Nvidia Spark vLLM?
I first created an image with Transformer 5, then referenced that image in the Docker compose file.

services:
  vllm-node:
    image: vllm-transformers5
    container_name: vllm-io
    environment: 
      - VLLM_API_SERVER_COUNT=2
    restart: unless-stopped
    
    # Networking and Privileges
    privileged: true
    network_mode: host
    ipc: host
    pid: host

    # GPU Access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Command: Keeps your bash wrapper to ensure environment variables load
    command: >
      bash -c -i "vllm serve 
      GadflyII/GLM-4.7-Flash-NVFP4 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors"

Then I get the error `usage: vllm serve [model_tag] [options]

vllm serve: error: argument --compilation-config/-cc: expected one argument`

I should mention that I tried adding --cc argument, as well as blank JSON argument into --cc, as well as one of the default values (e.g. mode 3), etc., but gives the same error message.

maybe give it a try here: https://github.com/eugr/spark-vllm-docker

To add to the previous comment - if using https://github.com/eugr/spark-vllm-docker with DGX Spark, make sure you build using --pre-tf flag, so it includes Transformers 5.
To run this model (just tested it).

Build:

./build-and-copy.sh \
 -t vllm-node-20260122-whl-tf5 \
--use-wheels --pre-tf --pre-flashinfer \
--rebuild-vllm --rebuild-deps

vllm serve arguments:

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--tool-call-parser glm47  \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--gpu-memory-utilization 0.7 \
--max-model-len 32768 \
--host 0.0.0.0 --port 8888

One note - NVFP4 performance in vLLM on Spark is not great currently. You will get much better performance from AWQ quants or even FP8!

GadflyII changed discussion status to closed

Sign up or log in to comment