EAGLE3 Draft Head — GLM-4.7-FP8
A lightweight EAGLE3 draft head for GLM-4.7-FP8 (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.
GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.
Blog post: 1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7
Usage
SGLang (GPU)
Requires our SGLang fork for GLM-4.7 Eagle3 support.
B=1 server (wide tree — optimal for single-user, real-time requests):
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
python -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 4 \
--tp 8 \
--trust-remote-code \
--port 30000
B=32 server (wide tree is also recommended at B=32 for this model):
python -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 4 \
--tp 8 \
--trust-remote-code \
--port 30000
Note: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.
Python Client
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
Training Details
| Parameter | Value |
|---|---|
| Framework | SpecForge (PyTorch), SGLang backend |
| Hardware | 8x NVIDIA H200 144GB (TP=8, DP=1) |
| Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 |
| Fine-tuning | 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5 |
| Optimizer | AdamW |
| Batch size | 1 (per device) |
| max_length | 1024 |
| TTT (tree training tokens) | 7 |
| Precision | bfloat16 |
| Training accuracy (acc_0) | 0.97 |
Training Method
EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
Regenerated Data
The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.
Performance
B=1 Inference Benchmarks (temp=0, FP8, TP=8)
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length |
|---|---|---|---|---|---|
| Terminal-Bench | 55.0 | 113.6 | 2.07x | 42.5% | 2.55 |
| MT-Bench | 66.5 | 106.7 | 1.60x | 42.5% | 2.55 |
| SWEBench-Verified | 66.1 | 104.0 | 1.57x | 45.0% | 2.70 |
| HumanEval | 66.8 | 102.2 | 1.53x | 54.2% | 3.25 |
| Mean | 63.6 | 106.6 | 1.69x | 46.1% | 2.76 |
B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| SWEBench-Verified | 922.7 | 1,108.4 | 1.20x |
| MT-Bench | 954.2 | 1,109.7 | 1.16x |
| Terminal-Bench | 952.3 | 1,104.3 | 1.16x |
| HumanEval | 915.1 | 1,035.9 | 1.13x |
| Mean | 936.1 | 1,089.6 | 1.16x |
Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit 63291f7f51.
Model Architecture
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLMEagle3 |
| Hidden size | 5120 |
| Num hidden layers | 1 |
| Num attention heads | 40 (8 KV heads) |
| head_dim | 128 |
| Intermediate size | 16384 |
| Auxiliary layers | [2, 46, 89] |
| Vocab size | 151552 (target) / 32000 (draft) |
| Checkpoint size | ~1.2 GB |
Limitations
- TP=8 required. FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
- Temperature sensitivity. Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
- FP8 quantization. The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
- Requires SGLang fork. Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
- JIT deep_gemm incompatible. Training requires
SGLANG_ENABLE_JIT_DEEPGEMM=0to avoid kernel assertion failures.
License
This draft head is released under the MIT License, matching the GLM-4.7-FP8 license.
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
- Downloads last month
- 20
Model tree for thoughtworks/GLM-4.7-FP8-Eagle3
Base model
zai-org/GLM-4.7-FP8