Instructions to use thoughtworks/GLM-4.7-FP8-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-FP8-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/GLM-4.7-FP8-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3

SGLang

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3
```

EAGLE3 Draft Head — GLM-4.7-FP8

A lightweight EAGLE3 draft head for GLM-4.7-FP8 (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.

GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.

Blog post: 1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7

Usage

SGLang (GPU)

Requires our SGLang fork for GLM-4.7 Eagle3 support.

B=1 server (wide tree — optimal for single-user, real-time requests):

pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.7-FP8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 4 \
    --tp 8 \
    --trust-remote-code \
    --port 30000

B=32 server (wide tree is also recommended at B=32 for this model):

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.7-FP8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 4 \
    --tp 8 \
    --trust-remote-code \
    --port 30000

Note: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.

Python Client

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

Training Details

Parameter	Value
Framework	SpecForge (PyTorch), SGLang backend
Hardware	8x NVIDIA H200 144GB (TP=8, DP=1)
Pre-training	6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4
Fine-tuning	3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5
Optimizer	AdamW
Batch size	1 (per device)
max_length	1024
TTT (tree training tokens)	7
Precision	bfloat16
Training accuracy (acc_0)	0.97

Training Method

EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

Regenerated Data

The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.

Performance

B=1 Inference Benchmarks (temp=0, FP8, TP=8)

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup	Accept Rate	Accept Length
Terminal-Bench	55.0	113.6	2.07x	42.5%	2.55
MT-Bench	66.5	106.7	1.60x	42.5%	2.55
SWEBench-Verified	66.1	104.0	1.57x	45.0%	2.70
HumanEval	66.8	102.2	1.53x	54.2%	3.25
Mean	63.6	106.6	1.69x	46.1%	2.76

B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup
SWEBench-Verified	922.7	1,108.4	1.20x
MT-Bench	954.2	1,109.7	1.16x
Terminal-Bench	952.3	1,104.3	1.16x
HumanEval	915.1	1,035.9	1.13x
Mean	936.1	1,089.6	1.16x

Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit 63291f7f51.

Model Architecture

Parameter	Value
Architecture	LlamaForCausalLMEagle3
Hidden size	5120
Num hidden layers	1
Num attention heads	40 (8 KV heads)
head_dim	128
Intermediate size	16384
Auxiliary layers	[2, 46, 89]
Vocab size	151552 (target) / 32000 (draft)
Checkpoint size	~1.2 GB

Limitations

TP=8 required. FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
Temperature sensitivity. Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
FP8 quantization. The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
Requires SGLang fork. Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
JIT deep_gemm incompatible. Training requires SGLANG_ENABLE_JIT_DEEPGEMM=0 to avoid kernel assertion failures.

License

This draft head is released under the MIT License, matching the GLM-4.7-FP8 license.

Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Downloads last month: 16

Safetensors

Model size

0.6B params

Tensor type

I64

BF16

BOOL

Model tree for thoughtworks/GLM-4.7-FP8-Eagle3

Base model

zai-org/GLM-4.7-FP8

Finetuned

(1)

this model

Paper for thoughtworks/GLM-4.7-FP8-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 10