Text Generation
Transformers
Safetensors
English
llama
eagle3
speculative-decoding
sglang
draft-model
Mixture of Experts
mixture-of-experts
gdn
hybrid-attention
code
text-generation-inference
Instructions to use thoughtworks/Qwen3-Coder-Next-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thoughtworks/Qwen3-Coder-Next-Eagle3")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3") model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thoughtworks/Qwen3-Coder-Next-Eagle3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3
- SGLang
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Docker Model Runner:
docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3
File size: 7,378 Bytes
03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 560a4c7 03af6d3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
library_name: transformers
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-Coder-Next
pipeline_tag: text-generation
tags:
- eagle3
- speculative-decoding
- sglang
- draft-model
- moe
- mixture-of-experts
- gdn
- hybrid-attention
- code
---
# EAGLE3 Draft Head — Qwen3-Coder-Next
A lightweight EAGLE3 draft head for [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers).
**Blog post**: [TODO: link after publication]
## Usage
### SGLang (GPU)
Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for Qwen3-Coder-Next Eagle3 support.
**B=1 server** (wide tree — optimal for single-user, real-time requests):
```bash
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-Next \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 8 \
--speculative-eagle-topk 4 \
--tp 4 \
--trust-remote-code \
--attention-backend triton \
--port 30000
```
**B=32 server** (narrow tree — eliminates Terminal-Bench regression):
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-Next \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
--speculative-num-steps 5 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 1 \
--tp 4 \
--trust-remote-code \
--attention-backend triton \
--port 30002
```
**Important**: Wide tree (topk=4) maximizes MT-Bench at B=32 (1.31x) but regresses Terminal-Bench (0.89x). Narrow tree (topk=1) eliminates the regression at the cost of lower peak speedup (1.10x MT-Bench). Use narrow tree for mixed or unknown workloads.
### Python Client
```python
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
```
## Training Details
| Parameter | Value |
|-----------|-------|
| Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend |
| Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
| Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 |
| Optimizer | AdamW |
| Batch size | 1 (per device) |
| max_length | 2048 |
| TTT (tree training tokens) | 7 |
| Precision | bfloat16 |
| Training accuracy (acc_0) | 0.97 |
### Training Method
EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 3, 23, 47 — first, middle, and last attention layers out of 12 total). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
GDN (linear recurrence) layers are excluded from auxiliary layer selection because their hidden states encode sequential recurrence rather than per-token representations, making them incompatible with EAGLE3's draft prediction.
## Performance
### B=1 Inference Benchmarks (temp=0, TP=4, Triton backend)
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length |
|---------|-----------------|----------------|---------|-------------|---------------|
| SWEBench-Verified | 163.9 | 249.7 | **1.52x** | 37.5% | 3.00 |
| HumanEval | 171.1 | 237.9 | **1.39x** | 20.0% | 1.60 |
| Terminal-Bench | 166.0 | 231.0 | **1.39x** | 34.7% | 2.77 |
| MT-Bench | 166.5 | 196.0 | **1.18x** | 30.6% | 2.45 |
| **Mean** | **166.9** | **228.7** | **1.37x** | **30.7%** | **2.46** |
### B=32 Inference Benchmarks (temp=0, TP=4, wide tree)
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---------|-----------------|----------------|---------|
| MT-Bench | 1,529.1 | 2,009.4 | **1.31x** |
| SWEBench-Verified | 2,010.4 | 2,186.5 | **1.09x** |
| HumanEval | 1,740.2 | 1,793.8 | **1.03x** |
| Terminal-Bench | 2,310.5 | 2,057.1 | 0.89x |
| **Mean** | **1,897.5** | **2,011.7** | **1.06x** |
### B=32 Inference Benchmarks (temp=0, TP=4, narrow tree)
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---------|-----------------|----------------|---------|
| MT-Bench | 1,529.1 | 1,688.6 | **1.10x** |
| Terminal-Bench | 2,310.5 | 2,379.8 | **1.03x** |
| HumanEval | 1,740.2 | 1,756.3 | **1.01x** |
| SWEBench-Verified | 2,010.4 | 1,998.7 | **1.00x** |
| **Mean** | **1,897.5** | **1,955.9** | **1.03x** |
*Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit `63291f7f51`.*
## Model Architecture
| Parameter | Value |
|-----------|-------|
| Architecture | LlamaForCausalLMEagle3 |
| Hidden size | 2048 |
| Num hidden layers | 1 |
| Num attention heads | 16 (4 KV heads) |
| head_dim | 128 |
| Intermediate size | 8192 |
| Auxiliary layers | [3, 23, 47] (attention layers only) |
| Vocab size | 151936 (target) / 32000 (draft) |
| Checkpoint size | ~278 MB |
## Limitations
- **TP=4 required.** FP8 block constraint: shared_expert dim=512, 512/8=64 not divisible by block_n=128.
- **Triton attention backend required.** FlashInfer is incompatible with head_dim=256 hybrid attention+GDN layers. Pass `--attention-backend triton`.
- **GDN layer constraint.** EAGLE3 auxiliary layers must be attention layers (every 4th), not GDN layers. The model code handles this automatically.
- **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates.
- **Terminal-Bench regression at B=32.** Wide tree (topk=4) regresses Terminal-Bench to 0.89x. Use narrow tree (topk=1) for mixed workloads.
- **Requires SGLang fork.** Upstream SGLang does not yet include the Qwen3-Next EAGLE3 patches.
## License
This draft head is released under Apache 2.0, matching the [Qwen3-Coder-Next license](https://huggingface.co/Qwen/Qwen3-Coder-Next).
## Citation
```bibtex
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
```
|