README.md · z-lab/Qwen3-Coder-Next-DFlash at main

File size: 3,139 Bytes

1b172f4
6b1027e
1b172f4
7b35a5b
 
4a62e0d
7b35a5b
4a62e0d
 
7b35a5b
 
 
1b172f4
 
7b35a5b
 
4a62e0d
7b35a5b
4a62e0d
7b35a5b
 
4a62e0d
7b35a5b
 
4a62e0d
7b35a5b
4a62e0d
7b35a5b
 
485ed4f
7b35a5b
 
4a62e0d
 
 
 
7b35a5b
485ed4f
 
 
 
7b35a5b
 
 
 
 
 
 
 
 
4a62e0d
7b35a5b
 
4a62e0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b35a5b
0bb6006
7b35a5b
 
4a62e0d
 
0bb6006

---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- dflash
- speculative-decoding
- block-diffusion
- draft-model
- efficiency
- qwen
- diffusion-language-model
---

# Qwen3-Coder-Next-DFlash

[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)

**DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next).

<div align="center">
  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
</div>

## Quick Start

### Installation

```bash
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
```

### Launch Server

Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).

```bash
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_DFLASH_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code
```

> **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.

### Usage

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-Coder-Next",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
)
print(response.choices[0].message.content)
```

### vLLM

Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.

## Acceptance Length

- Max new tokens: 4096
- Block size: 16 
| Dataset   | Accept Length |
|-----------|---------------|
| HumanEval | 7.25 |
| MBPP      | 5.50 |
| LiveCodeBench  | 5.50 |

## Acknowledgements

Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.

## Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).

```bibtex
@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
```