Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7

SGLang

How to use zai-org/GLM-4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7
```

which is the context length of GLM-4.7 ? 202752 or 128000 ?

#33

by tuo02 - opened Jan 15

Discussion

tuo02

Jan 15

•

edited Jan 15

In config.json file, max_position_embeddings=202752; while in tokenizer_config.json file, model_max_length=128000, which is the context length of GLM-4.7 ? And sglang output this message "Token indices sequence length is longer than the specified maximum sequence length for this model (192619 > 128000). Running this sequence through the model will result in indexing errors"

Sephyi

Feb 10

•

edited Feb 10

The short answer is that max_position_embeddings=202752 represents the actual context length.
The model_max_length=128000 in the tokenizer configuration is a soft default that causes your sglang warning. It’s not the actual architectural limit.

Here’s what’s happening:

max_position_embeddings (config.json) defines the maximum number of positions that the rotary positional embeddings (RoPE) can encode. This is the hard architectural ceiling.
The model was trained to handle up to 202,752 tokens. Beyond this, you’d encounter actual indexing errors because there are no learned or interpolated position encodings.

model_max_length (tokenizer_config.json) is a tokenizer-level hint that frameworks use as a default truncation or warning threshold.
It’s often set conservatively (here, 128K is the maximum output length). This doesn’t reflect what the model can actually accept as input. This is what sglang is reading when it throws that warning.

To fix the sglang warning, you can override the tokenizer’s limit when launching the model:

# If using sglang’s server:
python -m sglang.launch_server \
  —model-path /path/to/glm-4.7 \
  —context-length 202752 \
  …

Alternatively, if you’re loading the model via the Python API, set context_length=202752 explicitly. The warning is solely due to the mismatch between the tokenizer’s model_max_length and the model’s actual input capacity. As long as you stay under 202,752 tokens, you won’t encounter real indexing errors. It’s safe to ignore the warning or suppress it by overriding the tokenizer’s limit.

It’s worth noting that practical usable context may still be closer to ~200K tokens once you account for system prompt overhead and special tokens. However, the 128K figure is definitely incorrect for input. It’s the output cap.

tuo02

Feb 10

The short answer is that max_position_embeddings=202752 represents the actual context length.
The model_max_length=128000 in the tokenizer configuration is a soft default that causes your sglang warning. It’s not the actual architectural limit.

Here’s what’s happening:

max_position_embeddings (config.json) defines the maximum number of positions that the rotary positional embeddings (RoPE) can encode. This is the hard architectural ceiling.
The model was trained to handle up to 202,752 tokens. Beyond this, you’d encounter actual indexing errors because there are no learned or interpolated position encodings.

model_max_length (tokenizer_config.json) is a tokenizer-level hint that frameworks use as a default truncation or warning threshold.
It’s often set conservatively (here, 128K is the maximum output length). This doesn’t reflect what the model can actually accept as input. This is what sglang is reading when it throws that warning.

To fix the sglang warning, you can override the tokenizer’s limit when launching the model:
# If using sglang’s server:
python -m sglang.launch_server \
  —model-path /path/to/glm-4.7 \
  —context-length 202752 \
  …
Alternatively, if you’re loading the model via the Python API, set context_length=202752 explicitly. The warning is solely due to the mismatch between the tokenizer’s model_max_length and the model’s actual input capacity. As long as you stay under 202,752 tokens, you won’t encounter real indexing errors. It’s safe to ignore the warning or suppress it by overriding the tokenizer’s limit.

It’s worth noting that practical usable context may still be closer to ~200K tokens once you account for system prompt overhead and special tokens. However, the 128K figure is definitely incorrect for input. It’s the output cap.

Thank you

ZHANGYUXUAN-zR

Z.ai org Feb 10

202752 is the max

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment