RedHatAI/Qwen3-8B-speculator.dflash

This is a DFlash speculator model for Qwen/Qwen3-8B.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning). Training compute for this model was sponsored by Modal.

Commands

Using the Speculators library and the helper scripts provided in the repo.

Prepare data

# In virtual environment with speculators installed
python scripts/prepare_data.py \
  --model Qwen/Qwen3-8B
  --data ./regenerated_data.jsonl \
  --output ./output \
  --seq-length 8192

Launch vLLM

# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
  Qwen/Qwen3-8B \
  --target-layer-ids 2 10 18 26 34 \
  -- --port 8000 \
  --gpu-memory-utilization 0.9 \
  --disable-uvicorn-access-log \
  --tensor-parallel-size 1 \
  --data-parallel-size 2

Launch training

Must be run once vLLM has finished launching and is running in the background.

# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
  --standalone \
  --nproc_per_node 2 \
  scripts/train.py \
  --verifier-name-or-path Qwen/Qwen3-8B \
  --speculator-type dflash \
  --num-layers 5 \
  --data-path ./output \
  --vllm-endpoint http://localhost:8000/v1 \
  --save-path ./output/checkpoints \
  --epochs 3 \
  --lr 0.0006 \
  --total-seq-len 8192 \
  --on-missing generate \
  --on-generate delete \
  --seed 42 \
  --log-freq 100 \
  --draft-vocab-size 32000 \
  --draft-arch qwen3 \
  --target-layer-ids 2 10 18 26 34 \
  --draft-hidden-act silu \
  --scheduler-type cosine \
  --max-anchors 3072 \
  --prefetch-factor 2 \
  --num-workers 8

Model Specifications

Base Model Qwen/Qwen3-8B
Chat Template Qwen/Qwen3-8B (use /chat/completions endpoint)
Format Safetensors
License Apache 2.0
Validation Hardware Nvidia H100

Deployment

# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                          
# Deploy with speculative decoding                                                                                                                                                                                                                                                                        
vllm serve Qwen/Qwen3-8B \                                                                                                                                                                                                                                                                                
    --tensor-parallel-size 1 \                                                                                                                                                                                                                                                                            
    --max-model-len 16384 \                               
    --speculative-config '{                                                                                                                                                                                                                                                                               
        "model": "RedHatAI/Qwen3-8B-speculator.dflash",                                                                                                                                                                                                                                                   
        "num_speculative_tokens": 7,                                                                                                                                                                                                                                                                      
        "method": "dflash"                                                                                                                                                                                                                                                                                
    }'

Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg Length
HumanEval 79.9% 58.0% 40.3% 27.0% 17.8% 11.3% 6.8% 3.410
math_reasoning 82.2% 62.7% 46.2% 33.5% 23.4% 15.8% 9.9% 3.740
qa 68.9% 42.6% 25.0% 14.4% 8.1% 4.4% 2.3% 2.660
question 73.0% 47.6% 30.1% 18.9% 11.7% 7.1% 4.1% 2.930
rag 71.1% 44.8% 27.0% 15.7% 8.9% 4.9% 2.5% 2.750
summarization 65.5% 36.1% 19.0% 9.5% 4.7% 2.3% 1.1% 2.380
tool_call 71.3% 44.6% 25.8% 14.4% 7.8% 4.1% 2.1% 2.700
translation 63.8% 38.4% 22.1% 11.8% 6.1% 3.2% 1.5% 2.470
writing 73.2% 47.7% 30.1% 18.9% 11.8% 7.2% 4.2% 2.930

References

Paper: DFlash: Block Diffusion for Flash Speculative Decoding

Downloads last month
21
Safetensors
Model size
2B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for RedHatAI/Qwen3-8B-speculator.dflash