EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Paper • 2503.01840 • Published • 10
This is an EAGLE-3 draft model trained to accelerate inference of t-tech/T-pro-it-2.1-FP8 via speculative decoding.
Measured speedup: ~2x (after partial training; expected ~3-4x after full training).
EAGLE-3 (paper) is a speculative decoding method that trains a small (~1B) draft model to predict multiple tokens ahead, which are then verified by the large base model in a single forward pass. Unlike EAGLE/EAGLE-2, EAGLE-3 uses direct token prediction and multi-layer feature fusion (low/mid/high layers of the target model), enabling better scaling with training data.
import torch
from eagle3.model.ea_model import Eagle3Model
model = Eagle3Model.from_pretrained(
base_model_path="t-tech/T-pro-it-2.1-FP8",
eagle3_model_path="VirVen/T-pro-it-2.1-eagle3",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t-tech/T-pro-it-2.1-FP8", trust_remote_code=True)
messages = [{"role": "user", "content": "Привет! Расскажи про квантовые компьютеры."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(text, return_tensors="pt").input_ids.cuda()
with torch.no_grad():
output_ids = model.eagenerate(input_ids, temperature=0, max_new_tokens=512)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Install the EAGLE-3 package:
pip install git+https://github.com/SafeAILab/EAGLE.git
# or from your local repo:
pip install -e .
t-tech/T-pro-it-2.1-FP8 (Qwen3-32B architecture, FP8)