Qwen3.5-35B-A3B-DFlash
This model is still under training.
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3.5-35B-A3B. It was trained with a context length of 4096 tokens.
π Quick Start
SGLang
Installation
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
Inference
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-code \
--mamba-scheduler-strategy extra_buffer \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
Note: For long-context or agentic usage (such as OpenClaw or Claude Code), consider adding
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the draft model. Because the draft model is only trained on 4K context, this often improves performance on very long context (50K+ tokens).
Early Results
- Thinking: enabled
- Max new tokens: 4096
- Block size: 16
Dataset Accept Length GSM8K 6.830 Math500 7.249 HumanEval 8.002 MBPP 6.425 MT-Bench 5.302 Alpaca 5.040
- Downloads last month
- 512
Collection including z-lab/Qwen3.5-35B-A3B-DFlash
Collection
Block Diffusion for Flash Speculative Decoding β’ 11 items β’ Updated
β’ 22