Text Generation
Transformers
Safetensors
English
llama
eagle3
speculative-decoding
sglang
draft-model
Mixture of Experts
mixture-of-experts
gdn
hybrid-attention
code
text-generation-inference
Instructions to use thoughtworks/Qwen3-Coder-Next-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thoughtworks/Qwen3-Coder-Next-Eagle3")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3") model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thoughtworks/Qwen3-Coder-Next-Eagle3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3
- SGLang
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/Qwen3-Coder-Next-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Docker Model Runner:
docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3
Fix target model references: Qwen3-Next-80B-A3B-Instruct -> Qwen3-Coder-Next; fix narrow tree Terminal-Bench tok/s; remove internal comment
560a4c7 verified | library_name: transformers | |
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: Qwen/Qwen3-Coder-Next | |
| pipeline_tag: text-generation | |
| tags: | |
| - eagle3 | |
| - speculative-decoding | |
| - sglang | |
| - draft-model | |
| - moe | |
| - mixture-of-experts | |
| - gdn | |
| - hybrid-attention | |
| - code | |
| # EAGLE3 Draft Head — Qwen3-Coder-Next | |
| A lightweight EAGLE3 draft head for [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective. | |
| Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers). | |
| **Blog post**: [TODO: link after publication] | |
| ## Usage | |
| ### SGLang (GPU) | |
| Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for Qwen3-Coder-Next Eagle3 support. | |
| **B=1 server** (wide tree — optimal for single-user, real-time requests): | |
| ```bash | |
| pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python' | |
| python -m sglang.launch_server \ | |
| --model-path Qwen/Qwen3-Coder-Next \ | |
| --speculative-algorithm EAGLE3 \ | |
| --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \ | |
| --speculative-num-steps 3 \ | |
| --speculative-num-draft-tokens 8 \ | |
| --speculative-eagle-topk 4 \ | |
| --tp 4 \ | |
| --trust-remote-code \ | |
| --attention-backend triton \ | |
| --port 30000 | |
| ``` | |
| **B=32 server** (narrow tree — eliminates Terminal-Bench regression): | |
| ```bash | |
| python -m sglang.launch_server \ | |
| --model-path Qwen/Qwen3-Coder-Next \ | |
| --speculative-algorithm EAGLE3 \ | |
| --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \ | |
| --speculative-num-steps 5 \ | |
| --speculative-num-draft-tokens 6 \ | |
| --speculative-eagle-topk 1 \ | |
| --tp 4 \ | |
| --trust-remote-code \ | |
| --attention-backend triton \ | |
| --port 30002 | |
| ``` | |
| **Important**: Wide tree (topk=4) maximizes MT-Bench at B=32 (1.31x) but regresses Terminal-Bench (0.89x). Narrow tree (topk=1) eliminates the regression at the cost of lower peak speedup (1.10x MT-Bench). Use narrow tree for mixed or unknown workloads. | |
| ### Python Client | |
| ```python | |
| import requests | |
| response = requests.post( | |
| "http://localhost:30000/v1/chat/completions", | |
| json={ | |
| "model": "default", | |
| "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}], | |
| "max_tokens": 512, | |
| "temperature": 0, | |
| } | |
| ) | |
| print(response.json()["choices"][0]["message"]["content"]) | |
| ``` | |
| ## Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend | | |
| | Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) | | |
| | Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 | | |
| | Optimizer | AdamW | | |
| | Batch size | 1 (per device) | | |
| | max_length | 2048 | | |
| | TTT (tree training tokens) | 7 | | |
| | Precision | bfloat16 | | |
| | Training accuracy (acc_0) | 0.97 | | |
| ### Training Method | |
| EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 3, 23, 47 — first, middle, and last attention layers out of 12 total). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time. | |
| GDN (linear recurrence) layers are excluded from auxiliary layer selection because their hidden states encode sequential recurrence rather than per-token representations, making them incompatible with EAGLE3's draft prediction. | |
| ## Performance | |
| ### B=1 Inference Benchmarks (temp=0, TP=4, Triton backend) | |
| | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length | | |
| |---------|-----------------|----------------|---------|-------------|---------------| | |
| | SWEBench-Verified | 163.9 | 249.7 | **1.52x** | 37.5% | 3.00 | | |
| | HumanEval | 171.1 | 237.9 | **1.39x** | 20.0% | 1.60 | | |
| | Terminal-Bench | 166.0 | 231.0 | **1.39x** | 34.7% | 2.77 | | |
| | MT-Bench | 166.5 | 196.0 | **1.18x** | 30.6% | 2.45 | | |
| | **Mean** | **166.9** | **228.7** | **1.37x** | **30.7%** | **2.46** | | |
| ### B=32 Inference Benchmarks (temp=0, TP=4, wide tree) | |
| | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | | |
| |---------|-----------------|----------------|---------| | |
| | MT-Bench | 1,529.1 | 2,009.4 | **1.31x** | | |
| | SWEBench-Verified | 2,010.4 | 2,186.5 | **1.09x** | | |
| | HumanEval | 1,740.2 | 1,793.8 | **1.03x** | | |
| | Terminal-Bench | 2,310.5 | 2,057.1 | 0.89x | | |
| | **Mean** | **1,897.5** | **2,011.7** | **1.06x** | | |
| ### B=32 Inference Benchmarks (temp=0, TP=4, narrow tree) | |
| | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | | |
| |---------|-----------------|----------------|---------| | |
| | MT-Bench | 1,529.1 | 1,688.6 | **1.10x** | | |
| | Terminal-Bench | 2,310.5 | 2,379.8 | **1.03x** | | |
| | HumanEval | 1,740.2 | 1,756.3 | **1.01x** | | |
| | SWEBench-Verified | 2,010.4 | 1,998.7 | **1.00x** | | |
| | **Mean** | **1,897.5** | **1,955.9** | **1.03x** | | |
| *Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit `63291f7f51`.* | |
| ## Model Architecture | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Architecture | LlamaForCausalLMEagle3 | | |
| | Hidden size | 2048 | | |
| | Num hidden layers | 1 | | |
| | Num attention heads | 16 (4 KV heads) | | |
| | head_dim | 128 | | |
| | Intermediate size | 8192 | | |
| | Auxiliary layers | [3, 23, 47] (attention layers only) | | |
| | Vocab size | 151936 (target) / 32000 (draft) | | |
| | Checkpoint size | ~278 MB | | |
| ## Limitations | |
| - **TP=4 required.** FP8 block constraint: shared_expert dim=512, 512/8=64 not divisible by block_n=128. | |
| - **Triton attention backend required.** FlashInfer is incompatible with head_dim=256 hybrid attention+GDN layers. Pass `--attention-backend triton`. | |
| - **GDN layer constraint.** EAGLE3 auxiliary layers must be attention layers (every 4th), not GDN layers. The model code handles this automatically. | |
| - **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. | |
| - **Terminal-Bench regression at B=32.** Wide tree (topk=4) regresses Terminal-Bench to 0.89x. Use narrow tree (topk=1) for mixed workloads. | |
| - **Requires SGLang fork.** Upstream SGLang does not yet include the Qwen3-Next EAGLE3 patches. | |
| ## License | |
| This draft head is released under Apache 2.0, matching the [Qwen3-Coder-Next license](https://huggingface.co/Qwen/Qwen3-Coder-Next). | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{li2025eagle3, | |
| title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, | |
| author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, | |
| booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, | |
| year={2025} | |
| } | |
| ``` | |