Qwen2.5-14B-Instruct_EAGLE3_UltraChat

This repository contains the EAGLE-3 draft model presented in the paper Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs.

Code: GitHub - FailFast

Introduction

Qwen2.5-14B-Instruct_EAGLE3_UltraChat is trained based on the open-source Qwen2.5-14B-Instruct model using the SpecForge framework, and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.

Training Configuration

We adopted the default training hyperparameters in SpecForge and trained EAGLE-3 to match the target model's output until convergence.

This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.

  • Dataset: Utilized the UltraChat-200K dataset.
  • Training environment: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.

Model Inference Launch Command

To launch the EAGLE-3 algorithm service using vLLM, here is the instruction:

vllm serve Qwen/Qwen2.5-14B-Instruct \
  --dtype auto -tp 2 --max_model_len 2048 \
  --gpu-memory-utilization 0.8 --port 30000 \
  --speculative_config '{"model": "/PATH/TO/EAGLE/WEIGHTS", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'

To launch vanilla decoding, our performance baseline, here is the instruction:

vllm serve Qwen/Qwen2.5-14B-Instruct \
  --dtype auto -tp 2 --max_model_len 2048 \
  --gpu-memory-utilization 0.8 --port 30000

Performance Evaluation

We run our evaluations on two NVIDIA A6000-48GB GPUs connected via PCIe 4.0 x16. We conducted an extensive hyperparameter search of num_speculative_tokens from 3 to 20. In each entry, we report the best speedup across different speculation lengths. The following table reports the TPT speedup over vanilla decoding.

Target Model MATH AIME GSM8K GPQA HumanEval Average
Qwen2.5-32B-Instruct 2.51x 2.45x 2.27x 2.03x 2.68x 2.39x
Qwen2.5-14B-Instruct 2.33x 2.23x 2.19x 1.98x 2.61x 2.27x
Qwen2.5-7B-Instruct 2.19x 2.05x 2.02x 1.78x 2.25x 2.06x

Relevant Links

Downloads last month
8
Safetensors
Model size
0.5B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ruipeterpan/Qwen2.5-14B-Instruct_EAGLE3_UltraChat

Base model

Qwen/Qwen2.5-14B
Finetuned
(248)
this model

Dataset used to train ruipeterpan/Qwen2.5-14B-Instruct_EAGLE3_UltraChat

Paper for ruipeterpan/Qwen2.5-14B-Instruct_EAGLE3_UltraChat