Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

Cannot run vLLM on DGX Spark: ImportError: libcudart.so.12

#18

by yyg201708 - opened Jan 20

Discussion

yyg201708

Jan 20

vllm serve /home/sie/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash \
  --gpu-memory-utilization 0.8 \
  --served-model-name GB10 \
  --host 0.0.0.0 \
  --port 80 \
  --max-model-len 162144
Traceback (most recent call last):
  File "/home/sie/Downloads/models/glm_47_flash/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 16, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/__init__.py", line 5, in <module>
    from vllm.config.cache import CacheConfig
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/config/cache.py", line 14, in <module>
    from vllm.utils.mem_utils import format_gib, get_cpu_memory
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/mem_utils.py", line 14, in <module>
    from vllm.platforms import current_platform
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/__init__.py", line 257, in __getattr__
    _current_platform = resolve_obj_by_qualname(platform_cls_qualname)()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sie/Downloads/models/glm_47_flash/lib/python3.12/site-packages/vllm/platforms/cuda.py", line 16, in <module>
    import vllm._C  # noqa
    ^^^^^^^^^^^^^^
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ nvidia-smi
Tue Jan 20 14:10:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   39C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2765      G   /usr/lib/xorg/Xorg                       18MiB |
|    0   N/A  N/A            3003      G   /usr/bin/gnome-shell                      6MiB |
+-----------------------------------------------------------------------------------------+

 uv pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://   wheels.vllm.ai/nightly
Using Python 3.12.3 environment at: glm_47_flash
⠼ flashinfer-python==0.5.3                                                                                                                          ⠴ flashinfer-python==0.5.3                                                                                                                          ⠦ flashinfer-python==0.5.3                                                                                                                          Resolved 152 packages in 1m 42s
Prepared 152 packages in 6m 52s
Installed 152 packages in 98ms
 + aiohappyeyeballs==2.6.1
 + aiohttp==3.13.3
 + aiosignal==1.4.0
 + annotated-doc==0.0.4
 + annotated-types==0.7.0
 + anthropic==0.76.0
 + anyio==4.12.1
 + apache-tvm-ffi==0.1.8.post2
 + astor==0.8.1
 + attrs==25.4.0
 + blake3==1.0.8
 + cachetools==6.2.4
 + cbor2==5.8.0
 + certifi==2026.1.4
 + cffi==2.0.0
 + charset-normalizer==3.4.4
 + click==8.3.1
 + cloudpickle==3.1.2
 + compressed-tensors==0.13.0
 + cryptography==46.0.3
 + cuda-bindings==13.1.1
 + cuda-pathfinder==1.3.3
 + cuda-python==13.1.1
 + cupy-cuda12x==13.6.0
 + depyf==0.20.0
 + dill==0.4.1
 + diskcache==5.6.3
 + distro==1.9.0
 + dnspython==2.8.0
 + docstring-parser==0.17.0
 + einops==0.8.1
 + email-validator==2.3.0
 + fastapi==0.128.0
 + fastapi-cli==0.0.20
 + fastapi-cloud-cli==0.11.0
 + fastar==0.8.0
 + fastrlock==0.8.3
 + filelock==3.20.3
 + flashinfer-python==0.5.3
 + frozenlist==1.8.0
 + fsspec==2026.1.0
 + gguf==0.17.1
 + grpcio==1.78.0rc2
 + grpcio-reflection==1.78.0rc2
 + h11==0.16.0
 + hf-xet==1.2.1rc0
 + httpcore==1.0.9
 + httptools==0.7.1
 + httpx==0.28.1
 + httpx-sse==0.4.3
 + huggingface-hub==0.36.0
 + idna==3.11
 + ijson==3.4.0.post0
 + importlib-metadata==8.7.1
 + interegular==0.3.3
 + jinja2==3.1.6
 + jiter==0.12.0
 + jmespath==1.0.1
 + jsonschema==4.26.0
 + jsonschema-specifications==2025.9.1
 + lark==1.2.2
 + llguidance==1.3.0
 + llvmlite==0.44.0
 + lm-format-enforcer==0.11.3
 + loguru==0.7.3
 + markdown-it-py==4.0.0
 + markupsafe==3.0.3
 + mcp==1.25.0
 + mdurl==0.1.2
 + mistral-common==1.8.8
 + model-hosting-container-standards==0.1.13
 + mpmath==1.3.0
 + msgpack==1.1.2
 + msgspec==0.20.0
 + multidict==6.7.0
 + networkx==3.6.1
 + ninja==1.13.0
 + numba==0.61.2
 + numpy==2.2.6
 + nvidia-cudnn-frontend==1.17.0
 + nvidia-cutlass-dsl==4.3.5
 + nvidia-ml-py==13.590.44
 + openai==2.15.0
 + openai-harmony==0.0.8
 + opencv-python-headless==4.13.0.90
 + opentelemetry-api==1.39.1
 + opentelemetry-sdk==1.39.1
 + opentelemetry-semantic-conventions==0.60b1
 + outlines-core==0.2.11
 + packaging==26.0rc3
 + partial-json-parser==0.2.1.1.post7
 + pillow==12.1.0
 + prometheus-client==0.24.1
 + prometheus-fastapi-instrumentator==7.1.0
 + propcache==0.4.1
 + protobuf==6.33.4
 + psutil==7.2.1
 + py-cpuinfo==9.0.0
 + pybase64==1.4.3
 + pycountry==24.6.1
 + pycparser==2.23
 + pydantic==2.12.5
 + pydantic-core==2.41.5
 + pydantic-extra-types==2.11.0
 + pydantic-settings==2.12.0
 + pygments==2.19.2
 + pyjwt==2.10.1
 + python-dotenv==1.2.1
 + python-json-logger==4.0.0
 + python-multipart==0.0.21
 + pyyaml==6.0.3
 + pyzmq==27.1.0
 + ray==2.53.0
 + referencing==0.37.0
 + regex==2026.1.15
 + requests==2.32.5
 + rich==14.2.0
 + rich-toolkit==0.17.1
 + rignore==0.7.6
 + rpds-py==0.30.0
 + safetensors==0.7.0
 + sentencepiece==0.2.1
 + sentry-sdk==3.0.0a7
 + setproctitle==1.3.7
 + setuptools==80.9.0
 + shellingham==1.5.4
 + six==1.17.0
 + sniffio==1.3.1
 + sse-starlette==3.2.0
 + starlette==0.50.0
 + supervisor==4.3.0
 + sympy==1.14.0
 + tabulate==0.9.0
 + tiktoken==0.12.0
 + tokenizers==0.22.2
 + torch==2.9.1
 + torchaudio==2.9.1
 + torchvision==0.24.1
 + tqdm==4.67.1
 + transformers==4.57.6
 + typer==0.21.1
 + typing-extensions==4.15.0
 + typing-inspection==0.4.2
 + urllib3==2.6.3
 + uvicorn==0.40.0
 + uvloop==0.22.1
 + vllm==0.14.0rc2.dev163+g4753f3bf6
 + watchfiles==1.1.1
 + websockets==16.0
 + xgrammar==0.1.29
 + yarl==1.22.0
 + zipp==3.23.0

(glm_47_flash) sie@edgexpert-0882:~/Downloads/models$ uv pip install git+https://github.com/huggingface/transformers.git
Using Python 3.12.3 environment at: glm_47_flash
    Updated https://github.com/huggingface/transformers.git (314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
Resolved 25 packages in 3m 35s
      Built transformers @ git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a
Prepared 3 packages in 1.37s
Uninstalled 2 packages in 28ms
Installed 3 packages in 33ms
 - huggingface-hub==0.36.0
 + huggingface-hub==1.3.2
 - transformers==4.57.6
 + transformers==5.0.0.dev0 (from git+https://github.com/huggingface/transformers.git@314f10929a2215b74c2ad6ecf7b2f380c9b7468a)
 + typer-slim==0.21.1

ubergarm

Jan 20

Give ik_llama.cpp a try. Some discussion and links to GGUF quants here including a benchmark on RTX A6000: https://github.com/ikawrakow/ik_llama.cpp/issues/1167#issuecomment-3775037120

Kalle80

Jan 21

I have the same issue

ramixpe

Jan 22

•

edited Jan 22

try this:

docker run
--privileged
--gpus all
-it --rm
--network host --ipc=host
-v ~/.cache/huggingface:/root/.cache/huggingface
vllm-node
bash -c "pip install git+https://github.com/huggingface/transformers.git &&
curl -sSL https://raw.githubusercontent.com/eugr/spark-vllm-docker/main/mods/fix-Salyut1-GLM-4.7-NVFP4/glm4_moe.patch | patch -p1 -d / &&
vllm serve
zai-org/GLM-4.7-Flash
--port 8000 --host 0.0.0.0
--gpu-memory-utilization 0.9
--max-model-len 32768
--trust-remote-code"

https://github.com/eugr/spark-vllm-docker

dzur658

Jan 28

Hello! Noticed your CUDA driver is on version 13 but vllm is trying to load libcudart from CUDA version 12.0. CUDA is backwards compatible, but libcudart from version 12 still needs to be present. You can keep your main CUDA driver the same just pull the older version into your environment.

Conda

conda install nvidia/label/cuda-12.4.0::cuda-cudart

Or just Torch

pip install torch --index-url https://download.pytorch.org/whl/cu124

Hope that helps! :)

Also, if you'd like to read more about CUDA driver compatibility here is the Nvidia docs:
CUDA compatibility

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment