Model Overview

Qwen2.5-VL-72B-Instruct-eagle3 is a specialized draft model designed to accelerate the inference of the Qwen2.5-VL-72B-Instruct ecosystem using the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) framework.

Built upon the Llama architecture, this model acts as a highly efficient drafter. It has been trained on the ALLaVA-4V dataset, ensuring strict alignment with the teacher model's distribution.

These metrics demonstrate robust acceleration performance across diverse and complex domains on MMStar datasets.

MMStar Benchmark Performance Comparison (v0.5.6.post2)

Model Configuration	TP	Parallel	Throughput (token/s)	Accept Length
Qwen2.5-VL-72B-Instruct	2	1	42.76	1.000
Qwen2.5-VL-72B-Instruct	2	8	272.38	1.000
Qwen2.5-VL-72B-Instruct + Eagle3 (3 2 4)	2	1	95.39	2.750
Qwen2.5-VL-72B-Instruct + Eagle3 (3 2 4)	2	8	478.30	2.757

Quick Start

Requirements

NVIDIA GPU
CUDA 12.0+
PyTorch 2.0+

Installation

pip install sglang==0.5.6.post2

Inference with SGLang

python3 -m sglang.launch_server \
  --model-path Qwen2.5-VL-72B-Instruct \
  --speculative-draft-model-path AQ-MedAI/Qwen2.5-VL-72B-Instruct-eagle3 \
  --trust-remote-code \
  --speculative-algo EAGLE3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --tp 2 \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 30012

Training Data

The model was trained on 400,000 samples sourced from the ALLaVA-4V dataset.

Citation

If you use this model in your research or application, please cite the following:

@misc{qwen2.5vleagle3,
  title={Qwen2.5-VL-72B-Instruct-eagle3: Accelerating Instruction Following with EAGLE},
  author={Ant AQ Team},
  year={2026},
}

Downloads last month: 7

Safetensors

Model size

1B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support