Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

VLLM NVFP4 PROBLEM

#41

by prudant - opened Jan 23

Discussion

prudant

Jan 23

Hi, im trying to run the model on vllm and:

d7b5d7db87d592ae1c9e68ce2b17484e995bfd41e95e4f0c4353a8ad45d96b6d
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325]
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325] █ █ █▄ ▄█
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.14.0rc2.dev265+g7ef587375
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325] █▄█▀ █ █ █ █ model GadflyII/GLM-4.7-Flash-NVFP4
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:325]
(APIServer pid=95) INFO 01-23 07:56:54 [utils.py:261] non-default args: {'api_key': ['123'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'max_log_len': 100000, 'model': 'GadflyII/GLM-4.7-Flash-NVFP4', 'max_model_len': 32000, 'served_model_name': ['gpt-4o'], 'download_dir': '/data', 'gpu_memory_utilization': 0.86, 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'calculate_kv_scales': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 50, 'enable_chunked_prefill': True}
(APIServer pid=95) Traceback (most recent call last):
(APIServer pid=95) File "", line 198, in _run_module_as_main
(APIServer pid=95) File "", line 88, in _run_code
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 990, in
(APIServer pid=95) uvloop.run(run_server(args))
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=95) return __asyncio.run(
(APIServer pid=95) ^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=95) return runner.run(main)
(APIServer pid=95) ^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=95) return self._loop.run_until_complete(task)
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=95) return await main
(APIServer pid=95) ^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 918, in run_server
(APIServer pid=95) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 937, in run_server_worker
(APIServer pid=95) async with build_async_engine_client(
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=95) return await anext(self.gen)
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 147, in build_async_engine_client
(APIServer pid=95) async with build_async_engine_client_from_engine_args(
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=95) return await anext(self.gen)
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client_from_engine_args
(APIServer pid=95) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1369, in create_engine_config
(APIServer pid=95) model_config = self.create_model_config()
(APIServer pid=95) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1223, in create_model_config
(APIServer pid=95) return ModelConfig(
(APIServer pid=95) ^^^^^^^^^^^^
(APIServer pid=95) File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in init
(APIServer pid=95) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=95) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=95) Value error, The checkpoint you are trying to load has model type glm4_moe_lite but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=95)
(APIServer pid=95) You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=95) For further information visit https://errors.pydantic.dev/2.12/v/value_error

cleyesode

Jan 27

•

edited Jan 27

gadflyii fork, so I presume you are running blackwell. I don't know if they already merged his fork, but GLM-4.7-Flash-NVFP4 seems to be running on my blackwell sm_120 after building most recent vllm main from source. I'll also assume you are running from project root, in venv located at project root. I'll also assume you have compiled the most recent flashinfer from source. If not, you can still try the patch below but no guarantees for blackwell unless building from the most recent main.

flashinfer needs patches as below if you are running on a recent python version ('None' is being inserted where 'False' belongs - flashinfer hasn't yet updated the code to accommodate this recent python update).

Add to the top of your script to fix None type error, and some additional errors you were likely about to encounter. This won't harm your build even if the defined errors are not present (idempotent). You can fact check me with your model after it is running:

if [ -f "$FLASHINFER_DECODE_PY" ]; then
    # Patch 1: General .to(non_blocking=None) fix
    grep -q "non_blocking=non_blocking" "$FLASHINFER_DECODE_PY" && \
        sed -i 's/non_blocking=non_blocking/non_blocking=(non_blocking if non_blocking is not None else False)/g' "$FLASHINFER_DECODE_PY"

    # Patch 2: AttributeError: 'int' object has no attribute 'cpu'
    grep -q "seq_lens.cpu()" "$FLASHINFER_DECODE_PY" && \
        sed -i 's/kv_lens_arr_host = seq_lens.cpu()/kv_lens_arr_host = torch.tensor([seq_lens]) if isinstance(seq_lens, int) else seq_lens.cpu()/g' "$FLASHINFER_DECODE_PY"

    # Patch 3: TypeError: copy_() non_blocking logic (New Fix for Line 944)
    grep -q "and non_blocking)" "$FLASHINFER_DECODE_PY" && \
        sed -i 's/non_blocking=(indices.device == self.device) and non_blocking/non_blocking=bool((indices.device == self.device) and non_blocking)/g' "$FLASHINFER_DECODE_PY"
fi

if [ -f "$VLLM_FLASHINFER_BACKEND_PY" ]; then
    # Patch 4: vLLM V1 Scale Assertion Bug
    sed -i 's/^[[:space:]]*assert decode_wrapper._sm_scale == self.scale/    # assert decode_wrapper._sm_scale == self.scale/g' "$VLLM_FLASHINFER_BACKEND_PY"
fi

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment