RedHatAI/Qwen3-8B-speculator.dflash

This is a DFlash speculator model for Qwen/Qwen3-8B.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning). Training compute for this model was sponsored by Modal.

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-8B
  --data ./regenerated_data.jsonl \
  --output ./output \
  --seq-length 8192

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-8B \
  --target-layer-ids 2 10 18 26 34 \
  -- --port 8000 \
  --gpu-memory-utilization 0.9 \
  --disable-uvicorn-access-log \
  --tensor-parallel-size 1 \
  --data-parallel-size 2

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
  --standalone \
  --nproc_per_node 2 \
  scripts/train.py \
  --verifier-name-or-path Qwen/Qwen3-8B \
  --speculator-type dflash \
  --num-layers 5 \
  --data-path ./output \
  --vllm-endpoint http://localhost:8000/v1 \
  --save-path ./output/checkpoints \
  --epochs 3 \
  --lr 0.0006 \
  --total-seq-len 8192 \
  --on-missing generate \
  --on-generate delete \
  --seed 42 \
  --log-freq 100 \
  --draft-vocab-size 32000 \
  --draft-arch qwen3 \
  --target-layer-ids 2 10 18 26 34 \
  --draft-hidden-act silu \
  --scheduler-type cosine \
  --max-anchors 3072 \
  --prefetch-factor 2 \
  --num-workers 8

Model Specifications


Base Model	Qwen/Qwen3-8B
Chat Template	Qwen/Qwen3-8B (use `/chat/completions` endpoint)
Format	Safetensors
License	Apache 2.0
Validation Hardware	Nvidia H100

Deployment

# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve Qwen/Qwen3-8B \                                                                                                                                                                                                                                                                                
    --tensor-parallel-size 1 \                                                                                                                                                                                                                                                                            
    --max-model-len 16384 \                               
    --speculative-config '{                                                                                                                                                                                                                                                                               
        "model": "RedHatAI/Qwen3-8B-speculator.dflash",                                                                                                                                                                                                                                                   
        "num_speculative_tokens": 7,                                                                                                                                                                                                                                                                      
        "method": "dflash"                                                                                                                                                                                                                                                                                
    }'

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	79.9%	58.0%	40.3%	27.0%	17.8%	11.3%	6.8%	3.410
math_reasoning	82.2%	62.7%	46.2%	33.5%	23.4%	15.8%	9.9%	3.740
qa	68.9%	42.6%	25.0%	14.4%	8.1%	4.4%	2.3%	2.660
question	73.0%	47.6%	30.1%	18.9%	11.7%	7.1%	4.1%	2.930
rag	71.1%	44.8%	27.0%	15.7%	8.9%	4.9%	2.5%	2.750
summarization	65.5%	36.1%	19.0%	9.5%	4.7%	2.3%	1.1%	2.380
tool_call	71.3%	44.6%	25.8%	14.4%	7.8%	4.1%	2.1%	2.700
translation	63.8%	38.4%	22.1%	11.8%	6.1%	3.2%	1.5%	2.470
writing	73.2%	47.7%	30.1%	18.9%	11.8%	7.2%	4.2%	2.930

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month: 21

Safetensors

Model size

2B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for RedHatAI/Qwen3-8B-speculator.dflash

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 76