GadflyII/GLM-4.7-Flash-NVFP4
#3
by
Yu21342
- opened
What version of transformers are you running?
What version of transformers are you running?
5.0
try with :
--gpu-memory-utilization 0.85
Also, what did you set --max-model-len at?
Those are OOM's not the maintainers fault. Here's a pretty memory constrained config to try. If this works try removing swap space, then increase the max model len little by little.
Also set tensor parallel size to how many cards you have. The below was how I got the native model to run on my 2x5090 machine.
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 8000 \
--trust-remote-code \
--max-num-seqs 1 \
--gpu-memory-utilization 0.96 \
--swap-space 16 \
--enforce-eager \
--max-num-seqs 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000
The following is what I use for this quant:
export PYTORCH_ALLOC_CONF=expandable_segments:True
uv run vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
--download-dir /mnt/models/llm \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 80000 \
--trust-remote-code \
--max-num-seqs 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash \
--host 0.0.0.0 --port 8000


