Model Overview
Qwen2.5-VL-72B-Instruct-eagle3 is a specialized draft model designed to accelerate the inference of the Qwen2.5-VL-72B-Instruct ecosystem using the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) framework.
Built upon the Llama architecture, this model acts as a highly efficient drafter. It has been trained on the ALLaVA-4V dataset, ensuring strict alignment with the teacher model's distribution.
These metrics demonstrate robust acceleration performance across diverse and complex domains on MMStar datasets.
MMStar Benchmark Performance Comparison (v0.5.6.post2)
| Model Configuration | TP | Parallel | Throughput (token/s) | Accept Length |
|---|---|---|---|---|
| Qwen2.5-VL-72B-Instruct | 2 | 1 | 42.76 | 1.000 |
| Qwen2.5-VL-72B-Instruct | 2 | 8 | 272.38 | 1.000 |
| Qwen2.5-VL-72B-Instruct + Eagle3 (3 2 4) | 2 | 1 | 95.39 | 2.750 |
| Qwen2.5-VL-72B-Instruct + Eagle3 (3 2 4) | 2 | 8 | 478.30 | 2.757 |
Quick Start
Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
Installation
pip install sglang==0.5.6.post2
Inference with SGLang
python3 -m sglang.launch_server \
--model-path Qwen2.5-VL-72B-Instruct \
--speculative-draft-model-path AQ-MedAI/Qwen2.5-VL-72B-Instruct-eagle3 \
--trust-remote-code \
--speculative-algo EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 4 \
--tp 2 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 30012
Training Data
The model was trained on 400,000 samples sourced from the ALLaVA-4V dataset.
Citation
If you use this model in your research or application, please cite the following:
@misc{qwen2.5vleagle3,
title={Qwen2.5-VL-72B-Instruct-eagle3: Accelerating Instruction Following with EAGLE},
author={Ant AQ Team},
year={2026},
}
- Downloads last month
- 19
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
