EAGLE3 Draft Head — GLM-4.7-FP8

A lightweight EAGLE3 draft head for GLM-4.7-FP8 (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.

GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.

Blog post: 1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7

Usage

SGLang (GPU)

Requires our SGLang fork for GLM-4.7 Eagle3 support.

B=1 server (wide tree — optimal for single-user, real-time requests):

pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.7-FP8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 4 \
    --tp 8 \
    --trust-remote-code \
    --port 30000

B=32 server (wide tree is also recommended at B=32 for this model):

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.7-FP8 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 4 \
    --tp 8 \
    --trust-remote-code \
    --port 30000

Note: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.

Python Client

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

Training Details

Parameter Value
Framework SpecForge (PyTorch), SGLang backend
Hardware 8x NVIDIA H200 144GB (TP=8, DP=1)
Pre-training 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4
Fine-tuning 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5
Optimizer AdamW
Batch size 1 (per device)
max_length 1024
TTT (tree training tokens) 7
Precision bfloat16
Training accuracy (acc_0) 0.97

Training Method

EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

Regenerated Data

The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.

Performance

B=1 Inference Benchmarks (temp=0, FP8, TP=8)

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup Accept Rate Accept Length
Terminal-Bench 55.0 113.6 2.07x 42.5% 2.55
MT-Bench 66.5 106.7 1.60x 42.5% 2.55
SWEBench-Verified 66.1 104.0 1.57x 45.0% 2.70
HumanEval 66.8 102.2 1.53x 54.2% 3.25
Mean 63.6 106.6 1.69x 46.1% 2.76

B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
SWEBench-Verified 922.7 1,108.4 1.20x
MT-Bench 954.2 1,109.7 1.16x
Terminal-Bench 952.3 1,104.3 1.16x
HumanEval 915.1 1,035.9 1.13x
Mean 936.1 1,089.6 1.16x

Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit 63291f7f51.

Model Architecture

Parameter Value
Architecture LlamaForCausalLMEagle3
Hidden size 5120
Num hidden layers 1
Num attention heads 40 (8 KV heads)
head_dim 128
Intermediate size 16384
Auxiliary layers [2, 46, 89]
Vocab size 151552 (target) / 32000 (draft)
Checkpoint size ~1.2 GB

Limitations

  • TP=8 required. FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
  • Temperature sensitivity. Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
  • FP8 quantization. The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
  • Requires SGLang fork. Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
  • JIT deep_gemm incompatible. Training requires SGLANG_ENABLE_JIT_DEEPGEMM=0 to avoid kernel assertion failures.

License

This draft head is released under the MIT License, matching the GLM-4.7-FP8 license.

Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}
Downloads last month
20
Safetensors
Model size
0.6B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thoughtworks/GLM-4.7-FP8-Eagle3

Finetuned
(1)
this model

Paper for thoughtworks/GLM-4.7-FP8-Eagle3