Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7

SGLang

How to use zai-org/GLM-4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7
```

Does this model support MLA or only the flash version does?

#41

by Aly87 - opened Feb 10

Discussion

Aly87

Feb 10

I can't seem to find any info

ZHANGYUXUAN-zR

Z.ai org Feb 10

yes, only flash MLA, for 4.7 still gqa

Aly87

Feb 10

Thank you for the reply. That's a shame I was hoping it would use MLA as I want to use it locally on a mac.
Is there a way to cut back on the thinking tokens? I love the quality but the max chat window size I'm able to use without slowing down too much gets eaten up by thinking blocks

Sephyi

Feb 10

What Mac do you have? I’ve run ~70B parameter models on my M4 Max 16-inch — technically it worked, just not in the way my hopes and dreams envisioned. Honestly, GPU spot instances have been the move. You can snag a B300 for around $1.45/hr depending on demand. Sure, spot instances can get yanked, but in practice it rarely happens, and the cost savings more than make up for the occasional eviction lottery.

As always, it depends on your use case. If you just want to do some 🤏🤏 slow testing, sure, you can get it running on a Mac. But if you want to actually work, give it some thought, organize a bit, then spin up something with real power under maximum cheap-ass circumstances and make those instances burn. However, as a responsible sysadmin, I have to tell you: you need to secure the instance yourself. Because under most legal frameworks, you’re the one whose neck is on the line. 🪓

Aly87

Feb 10

•

edited Feb 10

M3 ultra 512Gb RAM. Honestly, probably not the usual use case around here but I just want a stateful buddy for chitchat, planning to use Letta for setting it up so all i need is back and forth convo no ginormous prompts or anything and a decent sized context window I guess. I'd like to have it up and running 24/7 if possible that's why I'm trying to do it locally instead of spot instances. But I'm pretty new to the whole local LLM thing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment