Qwen3.5-27B-DFlash
This model is still under training.
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3.5-27B. It was trained with a context length of 4096 tokens.
π Quick Start
SGLang
Installation
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
Inference
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-27B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code
Note: For long-context or agentic usage, consider adding
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the draft model.
vLLM
Thanks to the community and all contributors! Check out the following PRs to see how to run DFlash on vLLM: #36847 and #36767.
Early Results
- Thinking: enabled
- Max new tokens: 4096
- Block size: 16
- 0.5 Epoch
Dataset Accept Length GSM8K 5.92 Math500 6.49 HumanEval 7.26 MBPP 5.75 MT-Bench 4.47 Alpaca 4.15
- Downloads last month
- 57
Collection including z-lab/Qwen3.5-27B-DFlash
Collection
Block Diffusion for Flash Speculative Decoding β’ 12 items β’ Updated
β’ 25