File size: 3,139 Bytes
1b172f4 6b1027e 1b172f4 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 1b172f4 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 485ed4f 7b35a5b 4a62e0d 7b35a5b 485ed4f 7b35a5b 4a62e0d 7b35a5b 4a62e0d 7b35a5b 0bb6006 7b35a5b 4a62e0d 0bb6006 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | ---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- dflash
- speculative-decoding
- block-diffusion
- draft-model
- efficiency
- qwen
- diffusion-language-model
---
# Qwen3-Coder-Next-DFlash
[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
**DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next).
<div align="center">
<img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
</div>
## Quick Start
### Installation
```bash
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
```
### Launch Server
Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).
```bash
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_DFLASH_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-Next \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code
```
> **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
### Usage
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen/Qwen3-Coder-Next",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=4096,
)
print(response.choices[0].message.content)
```
### vLLM
Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.
## Acceptance Length
- Max new tokens: 4096
- Block size: 16
| Dataset | Accept Length |
|-----------|---------------|
| HumanEval | 7.25 |
| MBPP | 5.50 |
| LiveCodeBench | 5.50 |
## Acknowledgements
Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.
## Citation
If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).
```bibtex
@article{chen2026dflash,
title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}
``` |