EAGLE-Qwen3-8B

This repository contains the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) checkpoint trained for Qwen3-8B-Instruct. EAGLE is a speculative sampling method that accelerates LLM inference by 2-3x without sacrificing output quality.

Model Details

Base Model: alexchen4ai/Qwen3-8B-Instruct
EAGLE Parameters: ~0.24B
Training Environment: H200 node with 8 GPUs
Training Time: ~4 hours
Framework: Based on EAGLE paper

Installation

conda create -n eagle python=3.10
conda activate eagle
pip install torch transformers accelerate fastchat

Clone the EAGLE repository with Qwen3 support:

git clone https://github.com/your-repo/eagle-qnn.git
cd eagle-qnn
pip install -r requirements.txt

Usage

Basic Inference

from eagle.model.ea_model import EaModel
import torch

model = EaModel.from_pretrained(
    base_model_path="alexchen4ai/Qwen3-8B-Instruct",
    ea_model_path="alexchen4ai/qwen3-8B-eagle",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"
)
model.eval()

prompt = "What is the capital of France?"
input_ids = model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()

output_ids = model.eagenerate(
    input_ids,
    temperature=0.5,
    max_new_tokens=512
)
output = model.tokenizer.decode(output_ids[0])
print(output)

Alternative: Using Generic EAGLE

from eagle.modeling_eagle import EAGLE
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("alexchen4ai/Qwen3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "alexchen4ai/Qwen3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

eagle = EAGLE(model, "alexchen4ai/qwen3-8B-eagle")
inputs = tokenizer("What is machine learning?", return_tensors="pt")
outs = eagle.generate(**inputs, max_new_tokens=200, temperature=0.0)
output = tokenizer.decode(outs[0])
print(output)

Performance

EAGLE typically provides:

2-3x speedup in generation compared to standard autoregressive decoding
No quality degradation - outputs are identical to base model
Lower latency for interactive applications

Training

This checkpoint was trained using the following process:

Data Generation: Training data was generated from the base Qwen3-8B model
Training Configuration: Used the config in train/Qwen3_8B_config.json
Hardware: 8x H200 GPUs with mixed precision (bf16)
Duration: Approximately 4 hours

To train your own EAGLE checkpoint:

# Generate training data
bash ge_data_qwen3.sh

# Train the auto-regression head
bash train_qwen3.sh

Implementation Details

This implementation adds Qwen3 support to the original EAGLE framework by:

Adding model/modeling_qwen3_kv.py for KV cache management
Creating ge_data/ge_data_all_qwen3.py for data generation
Providing train/Qwen3_8B_config.json for training configuration

Citation

If you use this model, please cite the EAGLE paper:

@article{li2024eagle,
  title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2401.15077},
  year={2024}
}

License

Apache 2.0

Notes

Requires transformers >= 4.36
For batch inference (bs>1), set tokenizer.padding_side = "left"
This model is optimized for NPU deployment and quantization workflows

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexchen4ai/qwen3-8B-eagle

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

alexchen4ai/Qwen3-8B-Instruct