EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Paper
•
2401.15077
•
Published
•
20
This repository contains the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) checkpoint trained for Qwen3-8B-Instruct. EAGLE is a speculative sampling method that accelerates LLM inference by 2-3x without sacrificing output quality.
conda create -n eagle python=3.10
conda activate eagle
pip install torch transformers accelerate fastchat
Clone the EAGLE repository with Qwen3 support:
git clone https://github.com/your-repo/eagle-qnn.git
cd eagle-qnn
pip install -r requirements.txt
from eagle.model.ea_model import EaModel
import torch
model = EaModel.from_pretrained(
base_model_path="alexchen4ai/Qwen3-8B-Instruct",
ea_model_path="alexchen4ai/qwen3-8B-eagle",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto"
)
model.eval()
prompt = "What is the capital of France?"
input_ids = model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()
output_ids = model.eagenerate(
input_ids,
temperature=0.5,
max_new_tokens=512
)
output = model.tokenizer.decode(output_ids[0])
print(output)
from eagle.modeling_eagle import EAGLE
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("alexchen4ai/Qwen3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"alexchen4ai/Qwen3-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
eagle = EAGLE(model, "alexchen4ai/qwen3-8B-eagle")
inputs = tokenizer("What is machine learning?", return_tensors="pt")
outs = eagle.generate(**inputs, max_new_tokens=200, temperature=0.0)
output = tokenizer.decode(outs[0])
print(output)
EAGLE typically provides:
This checkpoint was trained using the following process:
train/Qwen3_8B_config.jsonTo train your own EAGLE checkpoint:
# Generate training data
bash ge_data_qwen3.sh
# Train the auto-regression head
bash train_qwen3.sh
This implementation adds Qwen3 support to the original EAGLE framework by:
model/modeling_qwen3_kv.py for KV cache managementge_data/ge_data_all_qwen3.py for data generationtrain/Qwen3_8B_config.json for training configurationIf you use this model, please cite the EAGLE paper:
@article{li2024eagle,
title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
journal={arXiv preprint arXiv:2401.15077},
year={2024}
}
Apache 2.0
tokenizer.padding_side = "left"Base model
Qwen/Qwen3-VL-8B-Instruct