Update README.md

Browse files

Files changed (1) hide show

README.md +40 -18

README.md CHANGED Viewed

@@ -3,35 +3,37 @@ license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - speculative-decoding
-- diffusion
 - efficiency
-- flash-decoding
 - qwen
 - diffusion-language-model
 ---
 # Qwen3-Coder-Next-DFlash
-[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
-**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
-This model is the **drafter** component. It must be used in conjunction with the target model `Qwen/Qwen3-Coder-Next` or its FP8 variant. It was trained with a context length of 4096 tokens.
 <div align="center">
-  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
 </div>
-## 🚀 Quick Start
-### SGLang
-#### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
 ```
-#### Inference
 ```bash
 python -m sglang.launch_server \
     --model-path Qwen/Qwen3-Coder-Next \
@@ -39,20 +41,40 @@ python -m sglang.launch_server \
     --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
     --speculative-num-draft-tokens 16 \
     --tp-size 1 \
-    --dtype bfloat16 \
     --attention-backend fa3 \
     --mem-fraction-static 0.75 \
-    --trust-remote-code \
     --mamba-scheduler-strategy extra_buffer \
-    --tool-call-parser qwen3_coder
 ```
-> **Note:** For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).
-#### Results
 - Max new tokens: 4096
 - Block size: 16
 | Dataset   | Accept Length |
 |-----------|---------------|
-| HumanEval | xxx |
-| MBPP      | xxx |
-| LiveCodeBench  | xxx |

 library_name: transformers
 pipeline_tag: text-generation
 tags:
+- dflash
 - speculative-decoding
+- block-diffusion
+- draft-model
 - efficiency
 - qwen
 - diffusion-language-model
 ---
 # Qwen3-Coder-Next-DFlash
+[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
+**DFlash** is a speculative decoding method that uses a lightweight **block diffusion** model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next).
 <div align="center">
+  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="85%">
 </div>
+## Quick Start
+### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
 ```
+### Launch Server
+Use `--speculative-num-draft-tokens` to set the block size (8 or **16**).
 ```bash
 python -m sglang.launch_server \
     --model-path Qwen/Qwen3-Coder-Next \
     --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
     --speculative-num-draft-tokens 16 \
     --tp-size 1 \
     --attention-backend fa3 \
     --mem-fraction-static 0.75 \
     --mamba-scheduler-strategy extra_buffer \
+    --trust-remote-code
 ```
+> **Tip:** For long-context or agentic workloads, add `--speculative-dflash-draft-window-size WINDOW_SIZE` to enable sliding-window attention for the drafter.
+### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-Coder-Next",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=4096,
+)
+print(response.choices[0].message.content)
+```
+### vLLM
+Community-contributed support is available. See PRs [#36847](https://github.com/vllm-project/vllm/pull/36847) and [#36767](https://github.com/vllm-project/vllm/pull/36767) for details.
+## Acceptance Length
 - Max new tokens: 4096
 - Block size: 16
+-
 | Dataset   | Accept Length |
 |-----------|---------------|
+| HumanEval | 7.25 |
+| MBPP      | 5.50 |
+| LiveCodeBench  | 5.50 |