jianchen0311 commited on
Commit
4a62e0d
·
verified ·
1 Parent(s): 7b35a5b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -18
README.md CHANGED
@@ -3,35 +3,37 @@ license: apache-2.0
3
  library_name: transformers
4
  pipeline_tag: text-generation
5
  tags:
 
6
  - speculative-decoding
7
- - diffusion
 
8
  - efficiency
9
- - flash-decoding
10
  - qwen
11
  - diffusion-language-model
12
  ---
13
 
14
  # Qwen3-Coder-Next-DFlash
15
- [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
16
 
17
- **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
18
 
19
- This model is the **drafter** component. It must be used in conjunction with the target model `Qwen/Qwen3-Coder-Next` or its FP8 variant. It was trained with a context length of 4096 tokens.
20
 
21
  <div align="center">
22
- <img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
23
  </div>
24
 
25
- ## 🚀 Quick Start
26
 
27
- ### SGLang
28
 
29
- #### Installation
30
  ```bash
31
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
32
  ```
33
 
34
- #### Inference
 
 
 
35
  ```bash
36
  python -m sglang.launch_server \
37
  --model-path Qwen/Qwen3-Coder-Next \
@@ -39,20 +41,40 @@ python -m sglang.launch_server \
39
  --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
40
  --speculative-num-draft-tokens 16 \
41
  --tp-size 1 \
42
- --dtype bfloat16 \
43
  --attention-backend fa3 \
44
  --mem-fraction-static 0.75 \
45
- --trust-remote-code \
46
  --mamba-scheduler-strategy extra_buffer \
47
- --tool-call-parser qwen3_coder
48
  ```
49
- > **Note:** For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).
50
 
51
- #### Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  - Max new tokens: 4096
53
  - Block size: 16
 
54
  | Dataset | Accept Length |
55
  |-----------|---------------|
56
- | HumanEval | xxx |
57
- | MBPP | xxx |
58
- | LiveCodeBench | xxx |
 
3
  library_name: transformers
4
  pipeline_tag: text-generation
5
  tags:
6
+ - dflash
7
  - speculative-decoding
8
+ - block-diffusion
9
+ - draft-model
10
  - efficiency
 
11
  - qwen
12
  - diffusion-language-model
13
  ---
14
 
15
  # Qwen3-Coder-Next-DFlash
 
16
 
17
+ [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
18
 
19
+ **DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next).
20
 
21
  <div align="center">
22
+ <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
23
  </div>
24
 
25
+ ## Quick Start
26
 
27
+ ### Installation
28
 
 
29
  ```bash
30
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
31
  ```
32
 
33
+ ### Launch Server
34
+
35
+ Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).
36
+
37
  ```bash
38
  python -m sglang.launch_server \
39
  --model-path Qwen/Qwen3-Coder-Next \
 
41
  --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
42
  --speculative-num-draft-tokens 16 \
43
  --tp-size 1 \
 
44
  --attention-backend fa3 \
45
  --mem-fraction-static 0.75 \
 
46
  --mamba-scheduler-strategy extra_buffer \
47
+ --trust-remote-code
48
  ```
 
49
 
50
+ > **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
51
+
52
+ ### Usage
53
+
54
+ ```python
55
+ from openai import OpenAI
56
+
57
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
58
+
59
+ response = client.chat.completions.create(
60
+ model="Qwen/Qwen3-Coder-Next",
61
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
62
+ max_tokens=4096,
63
+ )
64
+ print(response.choices[0].message.content)
65
+ ```
66
+
67
+ ### vLLM
68
+
69
+ Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.
70
+
71
+ ## Acceptance Length
72
+
73
  - Max new tokens: 4096
74
  - Block size: 16
75
+ -
76
  | Dataset | Accept Length |
77
  |-----------|---------------|
78
+ | HumanEval | 7.25 |
79
+ | MBPP | 5.50 |
80
+ | LiveCodeBench | 5.50 |