Model Overview
Qwen3-VL-8B-Instruct-eagle3 is a specialized draft model designed to accelerate the inference of the Qwen3-VL-8B-Instruct ecosystem using the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) framework.
Built upon the Llama architecture, this model acts as a highly efficient drafter. It has been trained on the ALLaVA-4V dataset, ensuring strict alignment with the teacher model's distribution.
These metrics demonstrate robust acceleration performance across diverse and complex domains on MMStar datasets.
MMStar Benchmark Performance Comparison (v0.5.6.post2)
| Model Configuration | TP | Parallel | Throughput (token/s) | Accept Length |
|---|---|---|---|---|
| Qwen3-VL-8B-Instruct | 1 | 1 | 171.215 | 1.000 |
| Qwen3-VL-8B-Instruct | 1 | 8 | 955.737 | 1.000 |
| Qwen3-VL-8B-Instruct + Eagle3 (3 2 4) | 1 | 1 | 255.190 | 2.493 |
| Qwen3-VL-8B-Instruct + Eagle3 (3 2 4) | 1 | 8 | 1411.867 | 2.485 |
Quick Start
Requirements
- NVIDIA GPU
- CUDA 12.0+
- PyTorch 2.0+
Installation
pip install sglang==0.5.6.post2
Inference with SGLang
python3 -m sglang.launch_server \
--model-path Qwen3-VL-8B-Instruct \
--speculative-draft-model-path AQ-MedAI/Qwen3-VL-8B-Instruct-eagle3 \
--trust-remote-code \
--speculative-algo EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 4 \
--tp 1 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 30012
Training Data
The model was trained on 400,000 samples sourced from the ALLaVA-4V dataset.
Citation
If you use this model in your research or application, please cite the following:
@misc{qwen3vleagle3,
title={Qwen3-VL-8B-Instruct-eagle3: Accelerating Instruction Following with EAGLE},
author={Ant AQ Team},
year={2026},
}
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
