Qwen3-Coder-30B-A3B-DFlash

Paper (Coming Soon) | GitHub | Blog

DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3-Coder-30B-A3B-Instruct.

DFlash Architecture

πŸ“Š Training Data & Efficiency

Qwen3-Coder-30B-A3B-DFlash is trained on 289K samples, composed of:

Despite being trained on significantly less data, DFlash already outperforms EAGLE-3 in inference acceleration. In comparison, lmsys/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct-SpecForge is trained on the open-perfect-blend dataset with 1.4M samples, nearly 5Γ— more data than DFlash.

This result highlights the training efficiency and scalability of DFlash, and suggests that further scaling the training data can unlock even greater acceleration gains.

πŸš€ Quick Start

SGLang

DFlash is now supported on SGLang. And vLLM integration is currently in progress.

Installation

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"

Inference

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code \

Transformers

Installation

pip install transformers==4.57.3 torch==2.9.0 accelerate

Inference

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

model = AutoModel.from_pretrained(
    "z-lab/Qwen3-Coder-30B-A3B-DFlash", 
    trust_remote_code=True, 
    dtype="auto", 
    device_map="cuda:0"
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-Coder-30B-A3B-Instruct", 
    dtype="auto", 
    device_map="cuda:0"
).eval()

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct")
tokenizer.add_special_tokens({"mask_token": "<|MASK|>"})

prompt = "Please provide a Python implementation of the Bubble Sort algorithm."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generate_ids = model.spec_generate(
    input_ids=model_inputs["input_ids"], 
    max_new_tokens=2048, 
    temperature=0.0, 
    target=target, 
    mask_token_id=tokenizer.mask_token_id, 
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))

Evaluation

DFlash consistently achieves higher speedups than the state-of-the-art speculative decoding method EAGLE-3. All experiments are conducted using SGLang on a single H200 GPU.

For EAGLE-3, we use the following speculative decoding configuration, which gives the best speedup for EAGLE-3:

  • --speculative-num-steps 7
  • --speculative-eagle-topk 1
  • --speculative-num-draft-tokens 8

For DFlash, we use a block size of 16 during speculation.

We compare against the EAGLE-3 checkpoint
lmsys/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct-SpecForge,
which is trained on 1.4M samples.

In contrast, DFlash is trained on only 289K samples, yet still delivers superior acceleration. This highlights the training efficiency of DFlash. We expect that scaling the training data further will enable DFlash to achieve even larger speedups.

LiveCodeBench

Batch Size Method Output Throughput (tokens/s) Acceptance Length Speedup vs. AR
1 Autoregressive 226.39 1.00 1.00Γ—
1 EAGLE3 435.63 4.49 1.92Γ—
1 DFlash 561.71 5.89 2.48Γ—
4 Autoregressive 610.80 1.00 1.00Γ—
4 EAGLE3 1205.69 4.49 1.97Γ—
4 DFlash 1538.73 5.91 2.52Γ—
8 Autoregressive 948.98 1.00 1.00Γ—
8 EAGLE3 1936.89 4.50 2.04Γ—
8 DFlash 2591.36 5.88 2.73Γ—
16 Autoregressive 1456.78 1.00 1.00Γ—
16 EAGLE3 3175.19 4.50 2.18Γ—
16 DFlash 4073.23 5.92 2.80Γ—

HumanEval

Batch Size Method Output Throughput (tokens/s) Acceptance Length Speedup vs. AR
1 Autoregressive 226.75 1.00 1.00Γ—
1 EAGLE3 497.21 5.33 2.19Γ—
1 DFlash 658.93 7.29 2.91Γ—
4 Autoregressive 595.48 1.00 1.00Γ—
4 EAGLE3 1299.63 5.32 2.18Γ—
4 DFlash 1695.39 7.30 2.85Γ—
8 Autoregressive 899.91 1.00 1.00Γ—
8 EAGLE3 1980.40 5.34 2.20Γ—
8 DFlash 2835.01 7.37 3.15Γ—
16 Autoregressive 1362.89 1.00 1.00Γ—
16 EAGLE3 3135.44 5.32 2.30Γ—
16 DFlash 4301.11 7.36 3.16Γ—

Acknowledgement

We are grateful to Yotta Labs for their compute support in training this draft model.

Citation

If you find DFlash useful for your research or applications, please cite our project. The full paper is coming soon!

@article{chen2026dflash,
  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author  = {Chen, Jian and Liu, Zhijian},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {[https://github.com/z-lab/dflash](https://github.com/z-lab/dflash)},
  note    = {Paper coming soon}
}
Downloads last month
199
Safetensors
Model size
0.5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including z-lab/Qwen3-Coder-30B-A3B-DFlash