Instructions to use thoughtworks/Qwen3-Coder-Next-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/Qwen3-Coder-Next-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/Qwen3-Coder-Next-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/Qwen3-Coder-Next-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/Qwen3-Coder-Next-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3

SGLang

How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/Qwen3-Coder-Next-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/Qwen3-Coder-Next-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/Qwen3-Coder-Next-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/Qwen3-Coder-Next-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/Qwen3-Coder-Next-Eagle3
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

EAGLE3 Draft Head — Qwen3-Coder-Next

A lightweight EAGLE3 draft head for Qwen3-Coder-Next (80B MoE, 512 experts, 10 active per token, GDN+attention hybrid, 48 layers). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.

Qwen3-Coder-Next uses a hybrid layer design that interleaves standard multi-head attention with GDN (linear recurrence) layers. Only 12 of 48 layers are attention layers (every 4th: 3, 7, 11, ..., 47). EAGLE3 auxiliary layers must be selected from attention layers only — GDN layers produce recurrent hidden states that are not compatible with EAGLE3. The model code handles this automatically, selecting layers 3, 23, 47 (first, middle, last attention layers).

Blog post: [TODO: link after publication]

Usage

SGLang (GPU)

Requires our SGLang fork for Qwen3-Coder-Next Eagle3 support.

B=1 server (wide tree — optimal for single-user, real-time requests):

pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --tp 4 \
    --trust-remote-code \
    --attention-backend triton \
    --port 30000

B=32 server (narrow tree — eliminates Terminal-Bench regression):

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 1 \
    --tp 4 \
    --trust-remote-code \
    --attention-backend triton \
    --port 30002

Important: Wide tree (topk=4) maximizes MT-Bench at B=32 (1.31x) but regresses Terminal-Bench (0.89x). Narrow tree (topk=1) eliminates the regression at the cost of lower peak speedup (1.10x MT-Bench). Use narrow tree for mixed or unknown workloads.

Python Client

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

Training Details

Parameter	Value
Framework	SpecForge (PyTorch), SGLang backend
Hardware	8x NVIDIA H200 144GB (TP=4, DP=2)
Pre-training	6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4
Optimizer	AdamW
Batch size	1 (per device)
max_length	2048
TTT (tree training tokens)	7
Precision	bfloat16
Training accuracy (acc_0)	0.97

Training Method

EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 3, 23, 47 — first, middle, and last attention layers out of 12 total). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

GDN (linear recurrence) layers are excluded from auxiliary layer selection because their hidden states encode sequential recurrence rather than per-token representations, making them incompatible with EAGLE3's draft prediction.

Performance

B=1 Inference Benchmarks (temp=0, TP=4, Triton backend)

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup	Accept Rate	Accept Length
SWEBench-Verified	163.9	249.7	1.52x	37.5%	3.00
HumanEval	171.1	237.9	1.39x	20.0%	1.60
Terminal-Bench	166.0	231.0	1.39x	34.7%	2.77
MT-Bench	166.5	196.0	1.18x	30.6%	2.45
Mean	166.9	228.7	1.37x	30.7%	2.46

B=32 Inference Benchmarks (temp=0, TP=4, wide tree)

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup
MT-Bench	1,529.1	2,009.4	1.31x
SWEBench-Verified	2,010.4	2,186.5	1.09x
HumanEval	1,740.2	1,793.8	1.03x
Terminal-Bench	2,310.5	2,057.1	0.89x
Mean	1,897.5	2,011.7	1.06x

B=32 Inference Benchmarks (temp=0, TP=4, narrow tree)

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup
MT-Bench	1,529.1	1,688.6	1.10x
Terminal-Bench	2,310.5	2,379.8	1.03x
HumanEval	1,740.2	1,756.3	1.01x
SWEBench-Verified	2,010.4	1,998.7	1.00x
Mean	1,897.5	1,955.9	1.03x

Config: B=1 uses steps=3, topk=4, draft_tokens=8. B=32 narrow uses steps=5, topk=1, draft_tokens=6. Hardware: 4x H200 (TP=4), Triton backend. SGLang commit 63291f7f51.

Model Architecture

Parameter	Value
Architecture	LlamaForCausalLMEagle3
Hidden size	2048
Num hidden layers	1
Num attention heads	16 (4 KV heads)
head_dim	128
Intermediate size	8192
Auxiliary layers	[3, 23, 47] (attention layers only)
Vocab size	151936 (target) / 32000 (draft)
Checkpoint size	~278 MB

Limitations

TP=4 required. FP8 block constraint: shared_expert dim=512, 512/8=64 not divisible by block_n=128.
Triton attention backend required. FlashInfer is incompatible with head_dim=256 hybrid attention+GDN layers. Pass --attention-backend triton.
GDN layer constraint. EAGLE3 auxiliary layers must be attention layers (every 4th), not GDN layers. The model code handles this automatically.
Temperature sensitivity. Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates.
Terminal-Bench regression at B=32. Wide tree (topk=4) regresses Terminal-Bench to 0.89x. Use narrow tree (topk=1) for mixed workloads.
Requires SGLang fork. Upstream SGLang does not yet include the Qwen3-Next EAGLE3 patches.

License

This draft head is released under Apache 2.0, matching the Qwen3-Coder-Next license.

Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Downloads last month: 49

Safetensors

Model size

0.1B params

Tensor type

I64

BF16

BOOL

Model tree for thoughtworks/Qwen3-Coder-Next-Eagle3

Base model

Qwen/Qwen3-Coder-Next

Finetuned

(34)

this model

Quantizations

1 model

Paper for thoughtworks/Qwen3-Coder-Next-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 10