Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zai-org/GLM-4.7 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.7" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.7
- SGLang
How to use zai-org/GLM-4.7 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.7
Does this model support MLA or only the flash version does?
I can't seem to find any info
yes, only flash MLA, for 4.7 still gqa
Thank you for the reply. That's a shame I was hoping it would use MLA as I want to use it locally on a mac.
Is there a way to cut back on the thinking tokens? I love the quality but the max chat window size I'm able to use without slowing down too much gets eaten up by thinking blocks
What Mac do you have? I’ve run ~70B parameter models on my M4 Max 16-inch — technically it worked, just not in the way my hopes and dreams envisioned. Honestly, GPU spot instances have been the move. You can snag a B300 for around $1.45/hr depending on demand. Sure, spot instances can get yanked, but in practice it rarely happens, and the cost savings more than make up for the occasional eviction lottery.
As always, it depends on your use case. If you just want to do some 🤏🤏 slow testing, sure, you can get it running on a Mac. But if you want to actually work, give it some thought, organize a bit, then spin up something with real power under maximum cheap-ass circumstances and make those instances burn. However, as a responsible sysadmin, I have to tell you: you need to secure the instance yourself. Because under most legal frameworks, you’re the one whose neck is on the line. 🪓
M3 ultra 512Gb RAM. Honestly, probably not the usual use case around here but I just want a stateful buddy for chitchat, planning to use Letta for setting it up so all i need is back and forth convo no ginormous prompts or anything and a decent sized context window I guess. I'd like to have it up and running 24/7 if possible that's why I'm trying to do it locally instead of spot instances. But I'm pretty new to the whole local LLM thing.