RedHatAI/Qwen3-8B-speculator.dflash
This is a DFlash speculator model for Qwen/Qwen3-8B.
Training Details
This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning). Training compute for this model was sponsored by Modal.
Commands
Using the Speculators library and the helper scripts provided in the repo.
Prepare data
# In virtual environment with speculators installed
python scripts/prepare_data.py \
--model Qwen/Qwen3-8B
--data ./regenerated_data.jsonl \
--output ./output \
--seq-length 8192
Launch vLLM
# In (separate) virutal environment with vllm installed
CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py \
Qwen/Qwen3-8B \
--target-layer-ids 2 10 18 26 34 \
-- --port 8000 \
--gpu-memory-utilization 0.9 \
--disable-uvicorn-access-log \
--tensor-parallel-size 1 \
--data-parallel-size 2
Launch training
Must be run once vLLM has finished launching and is running in the background.
# In virtual environment with speculators installed
CUDA_VISIBLE_DEVICES=2,3 torchrun \
--standalone \
--nproc_per_node 2 \
scripts/train.py \
--verifier-name-or-path Qwen/Qwen3-8B \
--speculator-type dflash \
--num-layers 5 \
--data-path ./output \
--vllm-endpoint http://localhost:8000/v1 \
--save-path ./output/checkpoints \
--epochs 3 \
--lr 0.0006 \
--total-seq-len 8192 \
--on-missing generate \
--on-generate delete \
--seed 42 \
--log-freq 100 \
--draft-vocab-size 32000 \
--draft-arch qwen3 \
--target-layer-ids 2 10 18 26 34 \
--draft-hidden-act silu \
--scheduler-type cosine \
--max-anchors 3072 \
--prefetch-factor 2 \
--num-workers 8
Model Specifications
| Base Model | Qwen/Qwen3-8B |
| Chat Template | Qwen/Qwen3-8B (use /chat/completions endpoint) |
| Format | Safetensors |
| License | Apache 2.0 |
| Validation Hardware | Nvidia H100 |
Deployment
# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head
# Deploy with speculative decoding
vllm serve Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--speculative-config '{
"model": "RedHatAI/Qwen3-8B-speculator.dflash",
"num_speculative_tokens": 7,
"method": "dflash"
}'
Preliminary Evaluations
Per-position token acceptance rates across datasets:
(with reasoning enabled)
| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length |
|---|---|---|---|---|---|---|---|---|
| HumanEval | 79.9% | 58.0% | 40.3% | 27.0% | 17.8% | 11.3% | 6.8% | 3.410 |
| math_reasoning | 82.2% | 62.7% | 46.2% | 33.5% | 23.4% | 15.8% | 9.9% | 3.740 |
| qa | 68.9% | 42.6% | 25.0% | 14.4% | 8.1% | 4.4% | 2.3% | 2.660 |
| question | 73.0% | 47.6% | 30.1% | 18.9% | 11.7% | 7.1% | 4.1% | 2.930 |
| rag | 71.1% | 44.8% | 27.0% | 15.7% | 8.9% | 4.9% | 2.5% | 2.750 |
| summarization | 65.5% | 36.1% | 19.0% | 9.5% | 4.7% | 2.3% | 1.1% | 2.380 |
| tool_call | 71.3% | 44.6% | 25.8% | 14.4% | 7.8% | 4.1% | 2.1% | 2.700 |
| translation | 63.8% | 38.4% | 22.1% | 11.8% | 6.1% | 3.2% | 1.5% | 2.470 |
| writing | 73.2% | 47.7% | 30.1% | 18.9% | 11.8% | 7.2% | 4.2% | 2.930 |
References
Paper: DFlash: Block Diffusion for Flash Speculative Decoding
- Downloads last month
- 21