Instructions to use zai-org/GLM-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-5

SGLang

How to use zai-org/GLM-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-5 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-5
```

Less Context Length than Expected(600k)

#57

by Forcewithme - opened Mar 2

Discussion

Forcewithme

Mar 2

Deploying with lmsysorg/sglang:glm5-hopper on 8*H20-3e(141G), with the official command:

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47  \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --served-model-name glm-5-fp8

I found that as as soon as the prefill token reach 600k, the servers return empty content, indicating the context length exceeds the limit. But it shouldn't.

On the same machine, With sglang0.5.9, qwen3.5-397b, kimi-k2.5, minimax-m2.5 can all reach the max context length ,which are 196k and 256k.

用官方给的镜像和部署命令部署服务，发现最多只支持到600k的上下文，一旦达到600k，response里content字段就为空。根据先前的经验，这通常提示显存不够。但是我们用相同的设备部署了qwen3.5b、kimi-k2.5和minimax-m2.5，发现都能支持到最长的上下文。其中kimi-k2.5是1T的模型。

所以不太理解为什么本地部署的glm5只支持到600k，希望官方或者社区大佬给出答案~

Forcewithme

Mar 3

And I found a new image on docker hub: docker pull lmsysorg/sglang:glm5-hopper-patched. What is this image used for ?

ZHANGYUXUAN-zR

Z.ai org Mar 24

GLM-5 Only support 200K context

ZHANGYUXUAN-zR changed discussion status to closed Mar 24

Forcewithme

Mar 24

GLM-5 Only support 200K context

It's a typo, it can only supports 60k, not 600k.

打错字了，我实测是发现最多只支持到60k。

ZHANGYUXUAN-zR

Z.ai org Mar 25

In what scenario are you testing this? Is the content field empty in the response because everything is in reasoning_content? You can print/log whether the model is actually outputting something but the parser isn't reading it.

ZHANGYUXUAN-zR changed discussion status to open Mar 25

Forcewithme

Mar 25

Random Input, Here is my test script:

import argparse
import sys
from openai import OpenAI
from transformers import AutoTokenizer, PreTrainedTokenizerFast

MODEL_NAME = "GLM5/"

ZOO_PATH = "/data/llm_zoo"
TOKEN_PATH = f"{ZOO_PATH}/{MODEL_NAME}"

def get_tokenizer():
    try:
        return AutoTokenizer.from_pretrained(TOKEN_PATH, trust_remote_code=True)
    except Exception as e:
        print(f"AutoTokenizer failed: {e}. Trying PreTrainedTokenizerFast...")
        try:
            return PreTrainedTokenizerFast.from_pretrained(TOKEN_PATH)
        except Exception as e2:
            print(f"Error loading tokenizer from '{TOKEN_PATH}'.")
            print("Please ensure you have updated the TOKEN_PATH variable in the script with the correct path.")
            print(f"Details: {e}")
            print(f"Fallback Details: {e2}")
            sys.exit(1)

def generate_prompt(tokenizer, target_token_count):
    """
    Generate a prompt with approximately the target number of tokens.
    We use a simple repeated token strategy.
    """
    # Find a simple token to repeat (e.g., token for "test" or "a")
    # Using a simple common token avoids complex merging issues usually
    sample_text = "test"
    sample_ids = tokenizer.encode(sample_text, add_special_tokens=False)
    if not sample_ids:
        token_id = 1 # Fallback
    else:
        token_id = sample_ids[0]

    # Create a list of token IDs
    input_ids = [token_id] * target_token_count

    # Decode to text so we can send it via API
    prompt = tokenizer.decode(input_ids)

    return prompt

def main():
    parser = argparse.ArgumentParser(description="Test max input token support for GLM5 API")
    parser.add_argument("--start_token", type=int, required=True, help="Starting token count")
    parser.add_argument("--add_token", type=int, required=True, help="Token increment step")
    parser.add_argument("--tokenizer", type=str, default=TOKEN_PATH, help="Path of tokenizer")

    args = parser.parse_args()

    print(f"Starting test with start_token={args.start_token}, add_token={args.add_token}")

    # Load tokenizer
    tokenizer = get_tokenizer()
    print(f"Tokenizer loaded successfully from {TOKEN_PATH}")

    # Initialize OpenAI client
    # No API key validation as requested
    client = OpenAI(
        api_key="none", 
        base_url="http://dummy_model_url/v1"
    )

    current_count = args.start_token

    while True:
        print(f"\nTesting input length: {current_count} tokens...")

        prompt = generate_prompt(tokenizer, current_count)

        try:
            # Call the API
            # model_name is not validated, using "default"
            response = client.chat.completions.create(
                model="default",
                messages=[
                    {"role": "user", "content": prompt}
                ],
                max_tokens=10, # Minimal output tokens needed
                stream=False
            )
            print(response.choices[0].message.reasoning_content, response.choices[0].message.content)
            # Check for "empty result" logic
            # If the server returns valid JSON but empty choices list
            if not response.choices[0].message.content and not response.choices[0].message.reasoning_content:
                print(f"[LIMIT REACHED] Empty result (no choices) received at {current_count} tokens.")
                break

            # If we get here, the request was successful
            usage_info = ""
            if hasattr(response, 'usage') and response.usage:
                usage_info = f"(API reported prompt_tokens: {response.usage.prompt_tokens})"

            print(f"Success. {usage_info}")

            # Increment and continue
            current_count += args.add_token

        except Exception as e:
            # If the SDK throws an error (e.g. connection error, or parsing error from empty body)
            print(f"[STOPPED] Exception occurred at {current_count} tokens.")
            print(f"Error message: {e}")
            break

if __name__ == "__main__":
    main()

Here are the logs:

(base) ➜  context python test_max_tokens.py --start_token 60000 --add_token 5000 
Starting test with start_token=60000, add_token=5000
Tokenizer loaded successfully from /data/llm_zoo/Qwen3.5-397B-A17B/

Testing input length: 60000 tokens...
testcss 对test None
Success. (API reported prompt_tokens: 60005)

Testing input length: 65000 tokens...
newr:// None
Success. (API reported prompt_tokens: 65005)

Testing input length: 70000 tokens...
1.0, if && None
Success. (API reported prompt_tokens: 70005)

Testing input length: 75000 tokens...
None None
[LIMIT REACHED] Empty result (no choices) received at 75000 tokens.

The empty context length is not stable. In my test, the shortest empty length is 60k.
Both content and reasoning_content are empty.
It's not caused by random input. This case is reported by the software engineer in my company, I am in charge of model deployment. And I reproduce this case with random input.

上面是测例和终端打印的日志。
输出空是指content和reasoning_content都空。
导致输出空的输入context长度不稳定，我测出来最短的长度是60k，最长是120k左右，也比理论上的最长上下文短。
随机输入应该不是导致输出空的原因，因为这个现象是业务开发的同事反馈给我的，我负责模型部署。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment