GLM-5-FP8 / README.md

zRzRzRzRzRzRzR

update

949e09d 1 day ago

9.56 kB

	---
	language:
	- en
	- zh
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	---

	# GLM-5-FP8

	<div align="center">
	<img src=https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg width="15%"/>
	</div>
	<p align="center">
	👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
	<br>
	📖 Check out the GLM-5 <a href="https://z.ai/blog/glm-5" target="_blank">technical blog</a>.
	<br>
	📍 Use GLM-5 API services on <a href="https://docs.z.ai/guides/llm/glm-5">Z.ai API Platform. </a>
	<br>
	👉 One click to <a href="https://chat.z.ai">GLM-5</a>.
	</p>

	## Introduction

	We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

	Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed [slime](https://github.com/THUDM/slime), a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

	## Benchmark

	\| \| GLM-5 \| GLM-4.7 \| DeepSeek-V3.2 \| Kimi K2.5 \| Claude Opus 4.5 \| Gemini 3 Pro \| GPT-5.2 (xhigh) \|
	\| -------------------------------- \| ---------------------- \| --------- \| ------------- \|-----------\| --------------- \| ------------ \| --------------- \|
	\| HLE \| 30.5 \| 24.8 \| 25.1 \| 31.5 \| 28.4 \| 37.2 \| 35.4 \|
	\| HLE (w/ Tools) \| 50.4 \| 42.8 \| 40.8 \| 51.8 \| 43.4* \| 45.8* \| 45.5* \|
	\| AIME 2026 I \| 92.7 \| 92.9 \| 92.7 \| 92.5 \| 93.3 \| 90.6 \| - \|
	\| HMMT Nov. 2025 \| 96.9 \| 93.5 \| 90.2 \| 91.1 \| 91.7 \| 93.0 \| 97.1 \|
	\| IMOAnswerBench \| 82.5 \| 82.0 \| 78.3 \| 81.8 \| 78.5 \| 83.3 \| 86.3 \|
	\| GPQA-Diamond \| 86.0 \| 85.7 \| 82.4 \| 87.6 \| 87.0 \| 91.9 \| 92.4 \|
	\| SWE-bench Verified \| 77.8 \| 73.8 \| 73.1 \| 76.8 \| 80.9 \| 76.2 \| 80.0 \|
	\| SWE-bench Multilingual \| 73.3 \| 66.7 \| 70.2 \| 73.0 \| 77.5 \| 65.0 \| 72.0 \|
	\| Terminal-Bench 2.0 (Terminus 2) \| 56.2 / 60.7 † \| 41.0 \| 39.3 \| 50.8 \| 59.3 \| 54.2 \| 54.0 \|
	\| Terminal-Bench 2.0 (Claude Code) \| 56.2 / 61.1 † \| 32.8 \| 46.4 \| - \| 57.9 \| - \| - \|
	\| CyberGym \| 43.2 \| 23.5 \| 17.3 \| 41.3 \| 50.6 \| 39.9 \| - \|
	\| BrowseComp \| 62.0 \| 52.0 \| 51.4 \| 60.6 \| 37.0 \| 37.8 \| - \|
	\| BrowseComp (w/ Context Manage) \| 75.9 \| 67.5 \| 67.6 \| 74.9 \| 67.8 \| 59.2 \| 65.8 \|
	\| BrowseComp-Zh \| 72.7 \| 66.6 \| 65.0 \| 62.3 \| 62.4 \| 66.8 \| 76.1 \|
	\| τ²-Bench \| 89.7 \| 87.4 \| 85.3 \| 80.2 \| 91.6 \| 90.7 \| 85.5 \|
	\| MCP-Atlas (Public Set) \| 67.8 \| 52.0 \| 62.2 \| 63.8 \| 65.2 \| 66.6 \| 68.0 \|
	\| Tool-Decathlon \| 38.0 \| 23.8 \| 35.2 \| 27.8 \| 43.5 \| 36.4 \| 46.3 \|
	\| Vending Bench 2 \| $4,432.12 \| $2,376.82 \| $1,034.00 \| $1,198.46 \| $4,967.06 \| $5,478.16 \| $3,591.33 \|

	> *: refers to their scores of full set.
	>
	> †: A verified version of Terminal-Bench 2.0 that fixes some ambiguous instructions.
	See footnote for more evaluation details.

	### Footnote

	* Humanity’s Last Exam (HLE) & other reasoning tasks: We evaluate with a maximum generation length of 131,072 tokens (`temperature=1.0, top_p=0.95, max_new_tokens=131072`). By default, we report the text-only subset; results marked with * are from the full set. We use GPT-5.2 (medium) as the judge model. For HLE-with-tools, we use a maximum context length of 202,752 tokens.
	* SWE-bench & SWE-bench Multilingual: We run the SWE-bench suite with OpenHands using a tailored instruction prompt. Settings: `temperature=0.7, top_p=0.95, max_new_tokens=16384`, with a 200K context window.
	* BrowserComp: Without context management, we retain details from the most recent 5 turns. With context management, we use the same discard-all strategy as DeepSeek-v3.2 and Kimi K2.5.
	* Terminal-Bench 2.0 (Terminus 2): We evaluate with the Terminus framework using `timeout=2h, temperature=0.7, top_p=1.0, max_new_tokens=8192`, with a 128K context window. Resource limits are capped at 16 CPUs and 32 GB RAM.
	* Terminal-Bench 2.0 (Claude Code): We evaluate in Claude Code 2.1.14 (think mode, default effort) with `temperature=1.0, top_p=0.95, max_new_tokens=65536`. We remove wall-clock time limits due to generation speed, while preserving per-task CPU and memory constraints. Scores are averaged over 5 runs. We fix environment issues introduced by Claude Code and also report results on a verified Terminal-Bench 2.0 dataset that resolves ambiguous instructions (see: [https://huggingface.co/datasets/zai-org/terminal-bench-2-verified](https://huggingface.co/datasets/zai-org/terminal-bench-2-verified)).
	* CyberGym: We evaluate in Claude Code 2.1.18 (think mode, no web tools) with (`temperature=1.0, top_p=1.0, max_new_tokens=32000`) and a 250-minute timeout per task. Results are single-run Pass@1 over 1,507 tasks.
	* MCP-Atlas: All models are evaluated in think mode on the 500-task public subset with a 10-minute timeout per task. We use Gemini 3 Pro as the judge model.
	* τ²-bench: We add a small prompt adjustment in Retail and Telecom to avoid failures caused by premature user termination. For Airline, we apply the domain fixes proposed in the Claude Opus 4.5 system card.
	* Vending Bench 2: Runs are conducted independently by [Andon Labs](https://andonlabs.com/evals/vending-bench-2).


	## Serve GLM-5 Locally

	### Prepare environment

	vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.

	+ vLLM

	Using Docker as:

	```shell
	docker pull vllm/vllm-openai:nightly
	```

	or using pip:

	```shell
	pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
	```

	then upgrade transformers:

	```
	pip install git+https://github.com/huggingface/transformers.git
	```

	+ SGLang

	Using Docker as:
	```bash
	docker pull lmsysorg/sglang:glm5-hopper # For Hopper GPU
	docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
	```

	### Deploy

	+ vLLM

	```shell
	vllm serve zai-org/GLM-5-FP8 \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.85 \
	--speculative-config.method mtp \
	--speculative-config.num_speculative_tokens 1 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	--served-model-name glm-5-fp8
	```

	Check the [recipes](https://github.com/vllm-project/recipes/blob/main/GLM/GLM5.md) for more details.

	+ SGLang

	```shell
	python3 -m sglang.launch_server \
	--model-path zai-org/GLM-5-FP8 \
	--tp-size 8 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--speculative-algorithm EAGLE \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--mem-fraction-static 0.85 \
	--served-model-name glm-5-fp8
	```

	Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5) for more details.

	+ xLLM and other Ascend NPU

	Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md).


	## Citation

	Our technical report is coming soon.