1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7
Speculative Decoding in 60 Seconds
If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.
LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.
The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.
EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~1.2 GB for GLM-4.7-FP8) and co-deploys on the same GPUs.
For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our first post on EAGLE3 for GLM-4.7-Flash.
Results
We are releasing thoughtworks/GLM-4.7-FP8-Eagle3 — an EAGLE3 draft head for the GLM-4.7-FP8 Mixture-of-Experts model.
B=1: Up to 2.07x Throughput
Single-user (B=1), temperature 0, TP=8, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=6.
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| Terminal-Bench | 55.0 | 113.6 | 2.07x |
| MT-Bench | 66.5 | 106.7 | 1.60x |
| SWEBench-Verified | 66.1 | 104.0 | 1.57x |
| HumanEval | 66.8 | 102.2 | 1.53x |
Mean: 1.69x across all datasets. The draft head costs ~1.2 GB on top of the ~218B target — less than 1% of model memory.
Hardware: 8x NVIDIA H200 144GB, TP=8. Draft head co-deployed on the same GPUs.
B=32: Consistent Gains, No Regressions
At batch 32 with 32 concurrent clients, GLM-4.7-FP8 maintains positive speedups across every dataset — something only the GLM-4.7-Flash model has also achieved in our portfolio. Tree config: steps=3, topk=4, draft_tokens=6 (same as B=1).
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| SWEBench-Verified | 922.7 | 1,108.4 | 1.20x |
| MT-Bench | 954.2 | 1,109.7 | 1.16x |
| Terminal-Bench | 952.3 | 1,104.3 | 1.16x |
| HumanEval | 915.1 | 1,035.9 | 1.13x |
Mean: 1.16x with no dataset below baseline. This is notable for a MoE model — MiniMax-M2.5 (a comparable-scale MoE) regresses to 0.96x mean at B=32 with the same wide tree config.
Why GLM-4.7 Holds Up at B=32
Most MoE models suffer at batch because the speculative tree forces the model to evaluate more tokens per step, and each token activates a separate set of experts. More tokens means more expert dispatches, which saturates memory bandwidth.
GLM-4.7's sigmoid top-8 routing appears to be more stable under speculation than the top-2 routing used by models like MiniMax-M2.5 and Qwen3-Coder-Next. With 8 active experts per token, the marginal cost of each speculated token is spread across a wider compute base. The result: tree verification overhead stays manageable even at batch 32.
We also tested narrow tree (topk=1, steps=5, tokens=6) at B=32. The mean was 1.14x — marginally worse than wide tree (1.16x). Unlike other MoE models where narrow tree helps, GLM-4.7-FP8 slightly prefers the wider tree at all batch sizes. Use wide tree for all workloads.
Configuration
| Parameter | Value |
|---|---|
| Target model | zai-org/GLM-4.7-FP8 (~218B MoE, ~40B active, sigmoid top-8) |
| Architecture | MoE: 160 experts, 8 active per token, sigmoid routing, 92 layers |
| Draft head | 1 layer, hidden_size=5120, aux layers [2, 46, 89] |
| Hardware | 8x H200 144GB, TP=8 |
| Training data | 54K mixed + regenerated fine-tuning (target responses at temp=0.8) |
| Training | 6 epochs original data (LR=1e-4) + 3 epochs regenerated (LR=5e-5) |
| SGLang version | v0.5.6 (tails-mpt/sglang) |
Where It Fits: Eagle3 Across Five Models
GLM-4.7-FP8 is the fourth Eagle3 draft head we have released. Here is the full portfolio, all benchmarked under identical conditions (temp=0, H200 GPUs):
B=1 Comparison
| Model | Params | Hardware | Mean Speedup |
|---|---|---|---|
| GLM-4.7-Flash | 31B MoE (3B active) | 1x H200 | 1.66x |
| GLM-4.7-FP8 | 218B MoE (40B active) | 8x H200 | 1.69x |
| MiniMax-M2.5 | 229B MoE (10B active) | 4x H200 | 1.39x |
| Gemma-4-31B | 31B dense (hybrid SWA) | 2x H200 | 1.30x |
B=32 Comparison
| Model | Mean Speedup | Any Regressions? |
|---|---|---|
| GLM-4.7-Flash | 1.16x | No |
| GLM-4.7-FP8 | 1.16x | No |
| MiniMax-M2.5 | 0.96x | Yes (SWEBench: 0.83x) |
| Gemma-4-31B | — | Incomplete (kernel crash) |
GLM-4.7-FP8 ties GLM-4.7-Flash for the best B=32 performance in the portfolio, and both are the only models with zero regressions across all datasets. For single-user latency (B=1), GLM-4.7-FP8's 1.69x is the best mean we have measured.
Engineering Notes
TP=8 Is Required
GLM-4.7-FP8 cannot run at TP=4. The model's shared expert has an intermediate dimension of 512, and 512/8 = 64 — not divisible by the FP8 block_n=128 constraint. At TP=8, the dimension is handled by the tensor parallelism split before the FP8 kernel boundary.
Regenerated Training Data Matters
The draft head was first trained for 6 epochs on generic mixed data (ShareGPT, UltraChat, PerfectBlend). This gave training accuracy of 0.90 (Exp C). We then fine-tuned for 3 additional epochs on data where the assistant responses were generated by GLM-4.7 itself at temp=0.8. This pushed accuracy to 0.97 (Exp E) and produced measurably better speedups.
The intuition: generic training data teaches the draft head to predict "reasonable" next tokens, but the target model has its own stylistic preferences. Regenerated data aligns the draft to the target's actual output distribution.
Training Required SGLANG_ENABLE_JIT_DEEPGEMM=0
GLM-4.7-FP8 triggers deep_gemm JIT kernel compilation during EAGLE3 training, which fails with a kernel_runtime.hpp:45 assertion on some CUDA environments. Setting SGLANG_ENABLE_JIT_DEEPGEMM=0 falls back to standard Triton MoE kernels. This is a training-time issue only — inference runs fine with default settings.
Caveats
- Temperature 0 only for production. At temp>0, MoE expert routing becomes non-deterministic. The draft head cannot predict which experts the target will activate, so acceptance rates drop and B=32 regresses. Deploy at temp=0 for coding, factual, and tool-use workloads.
- 8x H200 is the minimum. The FP8-quantized model fits on 8x H200 at TP=8. Smaller GPU counts or lower-memory GPUs (A100 80GB) are not sufficient.
- SGLang fork required. Our fork includes patches for GLM-4.7 Eagle3 support that have not yet been upstreamed.
How to Use
# Install our SGLang fork
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
# Launch server with Eagle3
python -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 4 \
--tp 8 \
--trust-remote-code \
--port 30000
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Explain the difference between TCP and UDP."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
The draft head checkpoint is ~1.2 GB and co-deploys on the same GPUs as the target model. No additional hardware required.
What's Next
We are continuing to expand the EAGLE3 portfolio to more model families and architectures. Each model teaches us something new about how speculative decoding interacts with model design — routing strategies, attention variants, quantization schemes. The draft heads, training scripts, and benchmark tooling are all open source.
- Draft head: thoughtworks/GLM-4.7-FP8-Eagle3
- Training framework: SpecForge
- Serving engine: SGLang fork
- EAGLE3 paper: arXiv:2503.01840
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}




