1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7

Community Article Published April 15, 2026

At Thoughtworks, we build inference optimization tools for production LLM deployments. GLM-4.7 is a 218-billion-parameter Mixture-of-Experts model with an unusual design: 160 experts with sigmoid top-8 routing — far more experts than most MoE architectures. We trained an EAGLE3 draft head for it and measured 1.69x mean single-user throughput and 2.07x on our Terminal-Bench coding dataset — all while maintaining 1.16x at batch 32, with no regressions on any dataset.


Speculative Decoding in 60 Seconds

If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.

LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.

The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.

EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~1.2 GB for GLM-4.7-FP8) and co-deploys on the same GPUs.

For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our first post on EAGLE3 for GLM-4.7-Flash.


Results

We are releasing thoughtworks/GLM-4.7-FP8-Eagle3 — an EAGLE3 draft head for the GLM-4.7-FP8 Mixture-of-Experts model.

B=1: Up to 2.07x Throughput

Single-user (B=1), temperature 0, TP=8, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=6.

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
Terminal-Bench 55.0 113.6 2.07x
MT-Bench 66.5 106.7 1.60x
SWEBench-Verified 66.1 104.0 1.57x
HumanEval 66.8 102.2 1.53x

Mean: 1.69x across all datasets. The draft head costs ~1.2 GB on top of the ~218B target — less than 1% of model memory.

01-speedup-bars

Hardware: 8x NVIDIA H200 144GB, TP=8. Draft head co-deployed on the same GPUs.

B=32: Consistent Gains, No Regressions

At batch 32 with 32 concurrent clients, GLM-4.7-FP8 maintains positive speedups across every dataset — something only the GLM-4.7-Flash model has also achieved in our portfolio. Tree config: steps=3, topk=4, draft_tokens=6 (same as B=1).

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
SWEBench-Verified 922.7 1,108.4 1.20x
MT-Bench 954.2 1,109.7 1.16x
Terminal-Bench 952.3 1,104.3 1.16x
HumanEval 915.1 1,035.9 1.13x

Mean: 1.16x with no dataset below baseline. This is notable for a MoE model — MiniMax-M2.5 (a comparable-scale MoE) regresses to 0.96x mean at B=32 with the same wide tree config.

02-batch-results

Why GLM-4.7 Holds Up at B=32

Most MoE models suffer at batch because the speculative tree forces the model to evaluate more tokens per step, and each token activates a separate set of experts. More tokens means more expert dispatches, which saturates memory bandwidth.

GLM-4.7's sigmoid top-8 routing appears to be more stable under speculation than the top-2 routing used by models like MiniMax-M2.5 and Qwen3-Coder-Next. With 8 active experts per token, the marginal cost of each speculated token is spread across a wider compute base. The result: tree verification overhead stays manageable even at batch 32.

We also tested narrow tree (topk=1, steps=5, tokens=6) at B=32. The mean was 1.14x — marginally worse than wide tree (1.16x). Unlike other MoE models where narrow tree helps, GLM-4.7-FP8 slightly prefers the wider tree at all batch sizes. Use wide tree for all workloads.

Configuration

Parameter Value
Target model zai-org/GLM-4.7-FP8 (~218B MoE, ~40B active, sigmoid top-8)
Architecture MoE: 160 experts, 8 active per token, sigmoid routing, 92 layers
Draft head 1 layer, hidden_size=5120, aux layers [2, 46, 89]
Hardware 8x H200 144GB, TP=8
Training data 54K mixed + regenerated fine-tuning (target responses at temp=0.8)
Training 6 epochs original data (LR=1e-4) + 3 epochs regenerated (LR=5e-5)
SGLang version v0.5.6 (tails-mpt/sglang)

Where It Fits: Eagle3 Across Five Models

GLM-4.7-FP8 is the fourth Eagle3 draft head we have released. Here is the full portfolio, all benchmarked under identical conditions (temp=0, H200 GPUs):

B=1 Comparison

Model Params Hardware Mean Speedup
GLM-4.7-Flash 31B MoE (3B active) 1x H200 1.66x
GLM-4.7-FP8 218B MoE (40B active) 8x H200 1.69x
MiniMax-M2.5 229B MoE (10B active) 4x H200 1.39x
Gemma-4-31B 31B dense (hybrid SWA) 2x H200 1.30x

B=32 Comparison

Model Mean Speedup Any Regressions?
GLM-4.7-Flash 1.16x No
GLM-4.7-FP8 1.16x No
MiniMax-M2.5 0.96x Yes (SWEBench: 0.83x)
Gemma-4-31B Incomplete (kernel crash)

GLM-4.7-FP8 ties GLM-4.7-Flash for the best B=32 performance in the portfolio, and both are the only models with zero regressions across all datasets. For single-user latency (B=1), GLM-4.7-FP8's 1.69x is the best mean we have measured.

03-portfolio


Engineering Notes

TP=8 Is Required

GLM-4.7-FP8 cannot run at TP=4. The model's shared expert has an intermediate dimension of 512, and 512/8 = 64 — not divisible by the FP8 block_n=128 constraint. At TP=8, the dimension is handled by the tensor parallelism split before the FP8 kernel boundary.

Regenerated Training Data Matters

The draft head was first trained for 6 epochs on generic mixed data (ShareGPT, UltraChat, PerfectBlend). This gave training accuracy of 0.90 (Exp C). We then fine-tuned for 3 additional epochs on data where the assistant responses were generated by GLM-4.7 itself at temp=0.8. This pushed accuracy to 0.97 (Exp E) and produced measurably better speedups.

The intuition: generic training data teaches the draft head to predict "reasonable" next tokens, but the target model has its own stylistic preferences. Regenerated data aligns the draft to the target's actual output distribution.

Training Required SGLANG_ENABLE_JIT_DEEPGEMM=0

GLM-4.7-FP8 triggers deep_gemm JIT kernel compilation during EAGLE3 training, which fails with a kernel_runtime.hpp:45 assertion on some CUDA environments. Setting SGLANG_ENABLE_JIT_DEEPGEMM=0 falls back to standard Triton MoE kernels. This is a training-time issue only — inference runs fine with default settings.


Caveats

  • Temperature 0 only for production. At temp>0, MoE expert routing becomes non-deterministic. The draft head cannot predict which experts the target will activate, so acceptance rates drop and B=32 regresses. Deploy at temp=0 for coding, factual, and tool-use workloads.
  • 8x H200 is the minimum. The FP8-quantized model fits on 8x H200 at TP=8. Smaller GPU counts or lower-memory GPUs (A100 80GB) are not sufficient.
  • SGLang fork required. Our fork includes patches for GLM-4.7 Eagle3 support that have not yet been upstreamed.

How to Use

# Install our SGLang fork
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

# Launch server with Eagle3
python -m sglang.launch_server \
    --model-path zai-org/GLM-4.7-FP8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 4 \
    --tp 8 \
    --trust-remote-code \
    --port 30000
import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Explain the difference between TCP and UDP."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

The draft head checkpoint is ~1.2 GB and co-deploys on the same GPUs as the target model. No additional hardware required.


What's Next

We are continuing to expand the EAGLE3 portfolio to more model families and architectures. Each model teaches us something new about how speculative decoding interacts with model design — routing strategies, attention variants, quantization schemes. The draft heads, training scripts, and benchmark tooling are all open source.


Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Community

Sign up or log in to comment