You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-Coder-Next-DFlash

Paper | GitHub | Blog

DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3-Coder-Next or its FP8 variant. It was trained with a context length of 4096 tokens.

DFlash Architecture

πŸš€ Quick Start

SGLang

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"

Inference

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-Next-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code \
    --mamba-scheduler-strategy extra_buffer \
    --tool-call-parser qwen3_coder

Note: For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding --speculative-dflash-draft-window-size WINDOW_SIZE to enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).

Results

  • Max new tokens: 4096
  • Block size: 16
    Dataset Accept Length
    HumanEval xxx
    MBPP xxx
    LiveCodeBench xxx
Downloads last month
63
Safetensors
Model size
0.5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including z-lab/Qwen3-Coder-Next-DFlash

Paper for z-lab/Qwen3-Coder-Next-DFlash