Instructions to use thoughtworks/GLM-4.7-FP8-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-FP8-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/GLM-4.7-FP8-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3

SGLang

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3
```

lujangusface commited on Apr 15

Commit

f600f87

verified ·

1 Parent(s): 73afa42

Release EAGLE3 draft head for GLM-4.7-FP8 (exp-e, acc=0.97)

Browse files

Files changed (3) hide show

README.md +167 -0
config.json +39 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,167 @@

+---
+library_name: transformers
+license: apache-2.0
+language:
+  - en
+base_model: THUDM/GLM-4.7
+pipeline_tag: text-generation
+tags:
+  - eagle3
+  - speculative-decoding
+  - sglang
+  - draft-model
+  - moe
+  - mixture-of-experts
+  - fp8
+---
+<!-- Internal: exp-e (gpu/glm47-fp8) -->
+# EAGLE3 Draft Head — GLM-4.7-FP8
+A lightweight EAGLE3 draft head for [GLM-4.7](https://huggingface.co/THUDM/GLM-4.7) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
+GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.
+**Blog post**: [TODO: link after publication]
+## Usage
+### SGLang (GPU)
+Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for GLM-4.7 Eagle3 support.
+**B=1 server** (wide tree — optimal for single-user, real-time requests):
+```bash
+pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
+python -m sglang.launch_server \
+    --model-path THUDM/GLM-4.7 \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
+    --speculative-num-steps 3 \
+    --speculative-num-draft-tokens 6 \
+    --speculative-eagle-topk 4 \
+    --tp 8 \
+    --trust-remote-code \
+    --port 30000
+```
+**B=32 server** (wide tree is also recommended at B=32 for this model):
+```bash
+python -m sglang.launch_server \
+    --model-path THUDM/GLM-4.7 \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
+    --speculative-num-steps 3 \
+    --speculative-num-draft-tokens 6 \
+    --speculative-eagle-topk 4 \
+    --tp 8 \
+    --trust-remote-code \
+    --port 30000
+```
+**Note**: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.
+### Python Client
+```python
+import requests
+response = requests.post(
+    "http://localhost:30000/v1/chat/completions",
+    json={
+        "model": "default",
+        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
+        "max_tokens": 512,
+        "temperature": 0,
+    }
+)
+print(response.json()["choices"][0]["message"]["content"])
+```
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
+| Hardware | 8x NVIDIA H200 144GB (TP=8, DP=1) |
+| Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 |
+| Fine-tuning | 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5 |
+| Optimizer | AdamW |
+| Batch size | 1 (per device) |
+| max_length | 1024 |
+| TTT (tree training tokens) | 7 |
+| Precision | bfloat16 |
+| Training accuracy (acc_0) | 0.97 |
+### Training Method
+EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
+### Regenerated Data
+The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.
+## Performance
+### B=1 Inference Benchmarks (temp=0, FP8, TP=8)
+| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length |
+|---------|-----------------|----------------|---------|-------------|---------------|
+| Terminal-Bench | 55.0 | 113.6 | **2.07x** | 42.5% | 2.55 |
+| MT-Bench | 66.5 | 106.7 | **1.60x** | 42.5% | 2.55 |
+| SWEBench-Verified | 66.1 | 104.0 | **1.57x** | 45.0% | 2.70 |
+| HumanEval | 66.8 | 102.2 | **1.53x** | 54.2% | 3.25 |
+| **Mean** | **63.6** | **106.6** | **1.69x** | **46.1%** | **2.76** |
+### B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)
+| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
+|---------|-----------------|----------------|---------|
+| SWEBench-Verified | 922.7 | 1,108.4 | **1.20x** |
+| MT-Bench | 954.2 | 1,109.7 | **1.16x** |
+| Terminal-Bench | 952.3 | 1,104.3 | **1.16x** |
+| HumanEval | 915.1 | 1,035.9 | **1.13x** |
+| **Mean** | **936.1** | **1,089.6** | **1.16x** |
+*Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit `63291f7f51`.*
+## Model Architecture
+| Parameter | Value |
+|-----------|-------|
+| Architecture | LlamaForCausalLMEagle3 |
+| Hidden size | 5120 |
+| Num hidden layers | 1 |
+| Num attention heads | 40 (8 KV heads) |
+| head_dim | 128 |
+| Intermediate size | 16384 |
+| Auxiliary layers | [2, 46, 89] |
+| Vocab size | 151552 (target) / 32000 (draft) |
+| Checkpoint size | ~1.2 GB |
+## Limitations
+- **TP=8 required.** FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
+- **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
+- **FP8 quantization.** The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
+- **Requires SGLang fork.** Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
+- **JIT deep_gemm incompatible.** Training requires `SGLANG_ENABLE_JIT_DEEPGEMM=0` to avoid kernel assertion failures.
+## License
+This draft head is released under Apache 2.0. Please verify the [GLM-4.7 license](https://huggingface.co/THUDM/GLM-4.7) for the target model.
+## Citation
+```bibtex
+@inproceedings{li2025eagle3,
+  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
+  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
+  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2025}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "architectures": [
+    "LlamaForCausalLMEagle3"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 151329,
+  "draft_vocab_size": 32000,
+  "dtype": "bfloat16",
+  "eagle_aux_hidden_state_layer_ids": [
+    2,
+    46,
+    89
+  ],
+  "eos_token_id": 151336,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 5120,
+  "initializer_range": 0.02,
+  "intermediate_size": 16384,
+  "max_position_embeddings": 4096,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 40,
+  "num_hidden_layers": 1,
+  "num_key_value_heads": 8,
+  "pad_token_id": null,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "rope_theta": 1000000.0,
+    "rope_type": "default"
+  },
+  "target_hidden_size": 5120,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true,
+  "vocab_size": 151552
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:770b811326f1be2f2881d47b00871d9ef724dad72dcffbdf20a574824043522f
+size 1187962360