Qwen3-Coder-Next-DFlash
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3-Coder-Next or its FP8 variant. It was trained with a context length of 4096 tokens.
π Quick Start
SGLang
Installation
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
Inference
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-Next \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-code \
--mamba-scheduler-strategy extra_buffer \
--tool-call-parser qwen3_coder
Note: For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).
Results
- Max new tokens: 4096
- Block size: 16
Dataset Accept Length HumanEval xxx MBPP xxx LiveCodeBench xxx
- Downloads last month
- 63
Collection including z-lab/Qwen3-Coder-Next-DFlash
Collection
Block Diffusion for Flash Speculative Decoding β’ 11 items β’ Updated
β’ 22