EAGLE-3 Draft Model for t-tech/T-pro-it-2.1-FP8

This is an EAGLE-3 draft model trained to accelerate inference of t-tech/T-pro-it-2.1-FP8 via speculative decoding.

Measured speedup: ~2x (after partial training; expected ~3-4x after full training).

What is EAGLE-3?

EAGLE-3 (paper) is a speculative decoding method that trains a small (~1B) draft model to predict multiple tokens ahead, which are then verified by the large base model in a single forward pass. Unlike EAGLE/EAGLE-2, EAGLE-3 uses direct token prediction and multi-layer feature fusion (low/mid/high layers of the target model), enabling better scaling with training data.

Usage

import torch
from eagle3.model.ea_model import Eagle3Model

model = Eagle3Model.from_pretrained(
    base_model_path="t-tech/T-pro-it-2.1-FP8",
    eagle3_model_path="VirVen/T-pro-it-2.1-eagle3",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t-tech/T-pro-it-2.1-FP8", trust_remote_code=True)

messages = [{"role": "user", "content": "Привет! Расскажи про квантовые компьютеры."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(text, return_tensors="pt").input_ids.cuda()

with torch.no_grad():
    output_ids = model.eagenerate(input_ids, temperature=0, max_new_tokens=512)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Requirements

Install the EAGLE-3 package:

pip install git+https://github.com/SafeAILab/EAGLE.git
# or from your local repo:
pip install -e .

Training details

Base model: t-tech/T-pro-it-2.1-FP8 (Qwen3-32B architecture, FP8)
Draft model: 1B params, single transformer layer with 2×hidden_size attention input
Feature fusion: layers 8 (low), 32 (mid), 62 (high) of the target model
Training data: ~50k samples from saiga dataset
Training: DeepSpeed ZeRO-2, 2× GPU, lr=5e-5

Downloads last month: 18

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VirVen/T-pro-it-2.1-eagle3

Base model

Qwen/Qwen3-32B

Finetuned

t-tech/T-pro-it-2.1

Quantized

t-tech/T-pro-it-2.1-FP8

Finetuned

(2)

this model

Paper for VirVen/T-pro-it-2.1-eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 10