Qwen2.5-14B-Instruct_EAGLE3_UltraChat
This repository contains the EAGLE-3 draft model presented in the paper Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs.
Code: GitHub - FailFast
Introduction
Qwen2.5-14B-Instruct_EAGLE3_UltraChat is trained based on the open-source Qwen2.5-14B-Instruct model using the SpecForge framework, and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
Training Configuration
We adopted the default training hyperparameters in SpecForge and trained EAGLE-3 to match the target model's output until convergence.
This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.
- Dataset: Utilized the UltraChat-200K dataset.
- Training environment: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.
Model Inference Launch Command
To launch the EAGLE-3 algorithm service using vLLM, here is the instruction:
vllm serve Qwen/Qwen2.5-14B-Instruct \
--dtype auto -tp 2 --max_model_len 2048 \
--gpu-memory-utilization 0.8 --port 30000 \
--speculative_config '{"model": "/PATH/TO/EAGLE/WEIGHTS", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
To launch vanilla decoding, our performance baseline, here is the instruction:
vllm serve Qwen/Qwen2.5-14B-Instruct \
--dtype auto -tp 2 --max_model_len 2048 \
--gpu-memory-utilization 0.8 --port 30000
Performance Evaluation
We run our evaluations on two NVIDIA A6000-48GB GPUs connected via PCIe 4.0 x16. We conducted an extensive hyperparameter search of num_speculative_tokens from 3 to 20. In each entry, we report the best speedup across different speculation lengths. The following table reports the TPT speedup over vanilla decoding.
| Target Model | MATH | AIME | GSM8K | GPQA | HumanEval | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-32B-Instruct | 2.51x | 2.45x | 2.27x | 2.03x | 2.68x | 2.39x |
| Qwen2.5-14B-Instruct | 2.33x | 2.23x | 2.19x | 1.98x | 2.61x | 2.27x |
| Qwen2.5-7B-Instruct | 2.19x | 2.05x | 2.02x | 1.78x | 2.25x | 2.06x |
Relevant Links
- Qwen2.5-14B-Instruct Open-source Weights: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct
- "Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs" [arXiv '25]: https://arxiv.org/pdf/2512.20573
- Artifact of FailFast: https://github.com/ruipeterpan/failfast
- Downloads last month
- 8