Qwen3.5-4B-DFlash

Paper | GitHub | Blog

DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel, achieving up to 3.7x speedup over autoregressive decoding. This is the drafter model, which must be paired with Qwen/Qwen3.5-4B.

DFlash Architecture

Quick Start

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"

Launch Server

Use --speculative-num-draft-tokens to set the block size (8 or 16).

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-4B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-4B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

Tip: For long-context or agentic workloads, add --speculative-dflash-draft-window-size WINDOW_SIZE to enable sliding-window attention for the drafter.

Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
)
print(response.choices[0].message.content)

vLLM

Community-contributed support is available. See PRs #36847 and #36767 for details.

Benchmark Results

Setup: Single NVIDIA B200, SGLang, thinking enabled, max output length 4096. We report end-to-end throughput, including prefill time. See our GitHub repository for reproduction scripts.

Throughput and Speedup

DFlash outperforms MTP across all block sizes and concurrency levels, achieving up to 3.7x speedup at concurrency 1.

Tokens/sec (speedup vs. autoregressive baseline)

Block Size = 16

Task Concurrency AR MTP DFlash
Math500 1 274 458 (1.7x) 959 (3.5x)
8 1971 3032 (1.5x) 5851 (3.0x)
16 3663 4827 (1.3x) 8563 (2.3x)
32 5836 6873 (1.2x) 10713 (1.8x)
GSM8K 1 271 432 (1.6x) 840 (3.1x)
8 1939 2778 (1.4x) 4945 (2.6x)
16 3599 4388 (1.2x) 7103 (2.0x)
32 5655 6209 (1.1x) 8806 (1.6x)
HumanEval 1 270 472 (1.7x) 1004 (3.7x)
8 1892 2940 (1.6x) 5495 (2.9x)
16 3393 4662 (1.4x) 7847 (2.3x)
32 5208 6448 (1.2x) 9333 (1.8x)
MBPP 1 273 404 (1.5x) 895 (3.3x)
8 1880 2504 (1.3x) 4884 (2.6x)
16 3295 3856 (1.2x) 6503 (2.0x)
32 5103 5608 (1.1x) 8216 (1.6x)
MT-Bench 1 271 394 (1.5x) 774 (2.9x)
8 1958 2501 (1.3x) 4512 (2.3x)
16 3635 3834 (1.1x) 6363 (1.8x)
32 5762 5468 (0.9x) 7834 (1.4x)
Alpaca 1 279 350 (1.3x) 680 (2.4x)
8 1987 2363 (1.2x) 4305 (2.2x)
16 3639 3771 (1.0x) 6161 (1.7x)
32 5720 5331 (0.9x) 7683 (1.3x)

Block Size = 8

Task Concurrency AR MTP DFlash
Math500 1 271 576 (2.1x) 803 (3.0x)
8 1947 3880 (2.0x) 5545 (2.8x)
16 3672 6291 (1.7x) 8804 (2.4x)
32 5849 9085 (1.6x) 12339 (2.1x)
GSM8K 1 275 537 (2.0x) 732 (2.7x)
8 1965 3592 (1.8x) 4902 (2.5x)
16 3620 5767 (1.6x) 7728 (2.1x)
32 5712 8254 (1.4x) 10710 (1.9x)
HumanEval 1 269 549 (2.0x) 790 (2.9x)
8 1888 3532 (1.9x) 5045 (2.7x)
16 3398 5621 (1.7x) 7672 (2.3x)
32 5156 7787 (1.5x) 10207 (2.0x)
MBPP 1 269 523 (1.9x) 764 (2.8x)
8 1842 3371 (1.8x) 4562 (2.5x)
16 3254 5165 (1.6x) 6639 (2.0x)
32 5047 7068 (1.4x) 8582 (1.7x)
MT-Bench 1 279 503 (1.8x) 688 (2.5x)
8 1996 3274 (1.6x) 4488 (2.2x)
16 3637 5160 (1.4x) 6868 (1.9x)
32 5741 7643 (1.3x) 9938 (1.7x)
Alpaca 1 272 467 (1.7x) 615 (2.3x)
8 1941 3156 (1.6x) 4112 (2.1x)
16 3590 5064 (1.4x) 6446 (1.8x)
32 5617 7279 (1.3x) 9003 (1.6x)

Acceptance Length

Format: MTP / DFlash

Task B8 B16
Math500 5.34 / 5.48 6.47 / 7.11
GSM8K 5.14 / 5.12 6.15 / 6.40
HumanEval 5.15 / 5.49 6.27 / 7.29
MBPP 4.63 / 5.05 5.27 / 6.34
MT-Bench 4.59 / 4.55 5.31 / 5.60
Alpaca 4.39 / 4.29 5.08 / 5.23

Acknowledgements

Special thanks to David Wang for his outstanding engineering support on this project. We are also grateful to Modal, InnoMatrix, and Yotta Labs for providing the compute resources used to train this draft model.

Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
Downloads last month
337
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including z-lab/Qwen3.5-4B-DFlash

Paper for z-lab/Qwen3.5-4B-DFlash