Update README.md
Browse files
README.md
CHANGED
|
@@ -3,35 +3,37 @@ license: apache-2.0
|
|
| 3 |
library_name: transformers
|
| 4 |
pipeline_tag: text-generation
|
| 5 |
tags:
|
|
|
|
| 6 |
- speculative-decoding
|
| 7 |
-
- diffusion
|
|
|
|
| 8 |
- efficiency
|
| 9 |
-
- flash-decoding
|
| 10 |
- qwen
|
| 11 |
- diffusion-language-model
|
| 12 |
---
|
| 13 |
|
| 14 |
# Qwen3-Coder-Next-DFlash
|
| 15 |
-
[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
|
| 16 |
|
| 17 |
-
**
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
<div align="center">
|
| 22 |
-
<img src="assets/dflash_system.png" alt="DFlash Architecture" width="
|
| 23 |
</div>
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
###
|
| 28 |
|
| 29 |
-
#### Installation
|
| 30 |
```bash
|
| 31 |
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
|
| 32 |
```
|
| 33 |
|
| 34 |
-
###
|
|
|
|
|
|
|
|
|
|
| 35 |
```bash
|
| 36 |
python -m sglang.launch_server \
|
| 37 |
--model-path Qwen/Qwen3-Coder-Next \
|
|
@@ -39,20 +41,40 @@ python -m sglang.launch_server \
|
|
| 39 |
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
|
| 40 |
--speculative-num-draft-tokens 16 \
|
| 41 |
--tp-size 1 \
|
| 42 |
-
--dtype bfloat16 \
|
| 43 |
--attention-backend fa3 \
|
| 44 |
--mem-fraction-static 0.75 \
|
| 45 |
-
--trust-remote-code \
|
| 46 |
--mamba-scheduler-strategy extra_buffer \
|
| 47 |
-
--
|
| 48 |
```
|
| 49 |
-
> **Note:** For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
- Max new tokens: 4096
|
| 53 |
- Block size: 16
|
|
|
|
| 54 |
| Dataset | Accept Length |
|
| 55 |
|-----------|---------------|
|
| 56 |
-
| HumanEval |
|
| 57 |
-
| MBPP |
|
| 58 |
-
| LiveCodeBench |
|
|
|
|
| 3 |
library_name: transformers
|
| 4 |
pipeline_tag: text-generation
|
| 5 |
tags:
|
| 6 |
+
- dflash
|
| 7 |
- speculative-decoding
|
| 8 |
+
- block-diffusion
|
| 9 |
+
- draft-model
|
| 10 |
- efficiency
|
|
|
|
| 11 |
- qwen
|
| 12 |
- diffusion-language-model
|
| 13 |
---
|
| 14 |
|
| 15 |
# Qwen3-Coder-Next-DFlash
|
|
|
|
| 16 |
|
| 17 |
+
[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
|
| 18 |
|
| 19 |
+
**DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next).
|
| 20 |
|
| 21 |
<div align="center">
|
| 22 |
+
<img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
|
| 23 |
</div>
|
| 24 |
|
| 25 |
+
## Quick Start
|
| 26 |
|
| 27 |
+
### Installation
|
| 28 |
|
|
|
|
| 29 |
```bash
|
| 30 |
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
|
| 31 |
```
|
| 32 |
|
| 33 |
+
### Launch Server
|
| 34 |
+
|
| 35 |
+
Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).
|
| 36 |
+
|
| 37 |
```bash
|
| 38 |
python -m sglang.launch_server \
|
| 39 |
--model-path Qwen/Qwen3-Coder-Next \
|
|
|
|
| 41 |
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
|
| 42 |
--speculative-num-draft-tokens 16 \
|
| 43 |
--tp-size 1 \
|
|
|
|
| 44 |
--attention-backend fa3 \
|
| 45 |
--mem-fraction-static 0.75 \
|
|
|
|
| 46 |
--mamba-scheduler-strategy extra_buffer \
|
| 47 |
+
--trust-remote-code
|
| 48 |
```
|
|
|
|
| 49 |
|
| 50 |
+
> **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
|
| 51 |
+
|
| 52 |
+
### Usage
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
from openai import OpenAI
|
| 56 |
+
|
| 57 |
+
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
|
| 58 |
+
|
| 59 |
+
response = client.chat.completions.create(
|
| 60 |
+
model="Qwen/Qwen3-Coder-Next",
|
| 61 |
+
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
|
| 62 |
+
max_tokens=4096,
|
| 63 |
+
)
|
| 64 |
+
print(response.choices[0].message.content)
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### vLLM
|
| 68 |
+
|
| 69 |
+
Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.
|
| 70 |
+
|
| 71 |
+
## Acceptance Length
|
| 72 |
+
|
| 73 |
- Max new tokens: 4096
|
| 74 |
- Block size: 16
|
| 75 |
+
-
|
| 76 |
| Dataset | Accept Length |
|
| 77 |
|-----------|---------------|
|
| 78 |
+
| HumanEval | 7.25 |
|
| 79 |
+
| MBPP | 5.50 |
|
| 80 |
+
| LiveCodeBench | 5.50 |
|