EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Paper • 2503.01840 • Published • 10
An EAGLE-3 speculative-decoding draft head trained for Qwen/Qwen2.5-0.5B using the speculators library.
EAGLE-3 uses a lightweight decoder layer to predict future hidden states, enabling the target model to verify multiple candidate tokens in a single forward pass — significantly improving inference throughput with no loss in output quality.
| Detail | Value |
|---|---|
| Base / target model | Qwen/Qwen2.5-0.5B |
| Draft architecture | 1 LLaMA-style decoder layer (~30 M trainable params) |
| TTT steps | 8 |
| Draft tokens | 4 |
| Training samples | 100 000 (random-token prompts + greedy completions) |
| Epochs | 5 |
| Learning rate | 5 × 10⁻⁵ (cosine schedule) |
| Sequence length | 1 024 |
| Framework | speculators ≥ 0.5.0 |
Serve the target model with this draft head for speculative decoding:
vllm serve Qwen/Qwen2.5-0.5B \
--speculative-model BalajiAI/qwen2.5-0.5b.eagle3 \
--num-speculative-tokens 4 \
--dtype bfloat16
This draft head is released under the Apache 2.0 license.
Base model
Qwen/Qwen2.5-0.5B