Model Overview

Qwen3-VL-8B-Instruct-eagle3 is a specialized draft model designed to accelerate the inference of the Qwen3-VL-8B-Instruct ecosystem using the EAGLE3 (Extrapolation Algorithm for Greater Language-model Efficiency) framework.

Built upon the Llama architecture, this model acts as a highly efficient drafter. It has been trained on the ALLaVA-4V dataset, ensuring strict alignment with the teacher model's distribution.

These metrics demonstrate robust acceleration performance across diverse and complex domains on MMStar datasets.

MMStar Benchmark Performance Comparison (v0.5.6.post2)

Model Configuration TP Parallel Throughput (token/s) Accept Length
Qwen3-VL-8B-Instruct 1 1 171.215 1.000
Qwen3-VL-8B-Instruct 1 8 955.737 1.000
Qwen3-VL-8B-Instruct + Eagle3 (3 2 4) 1 1 255.190 2.493
Qwen3-VL-8B-Instruct + Eagle3 (3 2 4) 1 8 1411.867 2.485

1

Quick Start

Requirements

  • NVIDIA GPU
  • CUDA 12.0+
  • PyTorch 2.0+

Installation

pip install sglang==0.5.6.post2

Inference with SGLang

python3 -m sglang.launch_server \
  --model-path Qwen3-VL-8B-Instruct \
  --speculative-draft-model-path AQ-MedAI/Qwen3-VL-8B-Instruct-eagle3 \
  --trust-remote-code \
  --speculative-algo EAGLE3 \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 2 \
  --speculative-num-draft-tokens 4 \
  --tp 1 \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 30012

Training Data

The model was trained on 400,000 samples sourced from the ALLaVA-4V dataset.

Citation

If you use this model in your research or application, please cite the following:

@misc{qwen3vleagle3,
  title={Qwen3-VL-8B-Instruct-eagle3: Accelerating Instruction Following with EAGLE},
  author={Ant AQ Team},
  year={2026},
}
Downloads last month
21
Safetensors
Model size
0.4B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support