EAGLE-3 Draft Head for Qwen/Qwen2.5-0.5B

An EAGLE-3 speculative-decoding draft head trained for Qwen/Qwen2.5-0.5B using the speculators library.

EAGLE-3 uses a lightweight decoder layer to predict future hidden states, enabling the target model to verify multiple candidate tokens in a single forward pass — significantly improving inference throughput with no loss in output quality.

Training Details

Detail	Value
Base / target model	`Qwen/Qwen2.5-0.5B`
Draft architecture	1 LLaMA-style decoder layer (~30 M trainable params)
TTT steps	8
Draft tokens	4
Training samples	100 000 (random-token prompts + greedy completions)
Epochs	5
Learning rate	5 × 10⁻⁵ (cosine schedule)
Sequence length	1 024
Framework	speculators ≥ 0.5.0

Usage with vLLM

Serve the target model with this draft head for speculative decoding:

vllm serve Qwen/Qwen2.5-0.5B \
  --speculative-model BalajiAI/qwen2.5-0.5b.eagle3 \
  --num-speculative-tokens 4 \
  --dtype bfloat16

License

This draft head is released under the Apache 2.0 license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BalajiAI/qwen2.5-0.5b.eagle3

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(614)

this model

Paper for BalajiAI/qwen2.5-0.5b.eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 10