You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3.5-27B-DFlash

Paper | GitHub | Blog

This model is still under training.

DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3.5-27B. It was trained with a context length of 4096 tokens.

DFlash Architecture

πŸš€ Quick Start

SGLang

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"

Inference

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

Note: For long-context or agentic usage, consider adding --speculative-dflash-draft-window-size WINDOW_SIZE to enable sliding-window attention for the draft model.

vLLM

Thanks to the community and all contributors! Check out the following PRs to see how to run DFlash on vLLM: #36847 and #36767.

Early Results

  • Thinking: enabled
  • Max new tokens: 4096
  • Block size: 16
  • 0.5 Epoch
    Dataset Accept Length
    GSM8K 5.92
    Math500 6.49
    HumanEval 7.26
    MBPP 5.75
    MT-Bench 4.47
    Alpaca 4.15
Downloads last month
57
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including z-lab/Qwen3.5-27B-DFlash

Paper for z-lab/Qwen3.5-27B-DFlash