jianchen0311's picture
Update README.md
99f8175 verified
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- speculative-decoding
- diffusion
- efficiency
- flash-decoding
- qwen
- diffusion-language-model
---
# LLaMA3.1-8B-Instruct-DFlash-UltraChat
[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the **drafter** component. It must be used in conjunction with the target model `meta-llama/Llama-3.1-8B-Instruct`.
<div align="center">
<img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
</div>
## 📊 Training Data
**LLaMA3.1-8B-Instruct-DFlash-UltraChat** is trained on **Ultrachat-200K** and **ShareGPT** datasets, aiming to align with EAGLE-3 training data. The assistant reponses in the datasets are regenerated by `meta-llama/Llama-3.1-8B-Instruct`.
## 🚀 Quick Start
### SGLang
DFlash is now supported on SGLang. And vLLM integration is currently in progress.
#### Installation
```bash
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
```
#### Inference
```bash
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-code
```
### Transformers
#### Installation
```bash
pip install transformers==4.57.3 torch==2.9.0 accelerate
```
#### Inference
```python
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
model = AutoModel.from_pretrained(
"z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat",
trust_remote_code=True,
dtype="auto",
device_map="cuda:0"
).eval()
target = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
dtype="auto",
device_map="cuda:0"
).eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generate_ids = model.spec_generate(
input_ids=model_inputs["input_ids"],
max_new_tokens=2048,
temperature=0.0,
target=target,
stop_token_ids=[tokenizer.eos_token_id]
)
print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))
```
## Evaluation
DFlash consistently achieves higher speedups than the state-of-the-art speculative decoding method **EAGLE-3**. All experiments are conducted using **SGLang** on a single **B200 GPU**.
For EAGLE-3, we evaluate two speculative decoding configurations:
- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 10`
- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 60`, which is the **official** setting used in the EAGLE-3 paper.
For DFlash, we use a block size of 10 during speculation.
We compare against the EAGLE-3 checkpoint [lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B), which is the **official** EAGLE-3 checkpoint adapted for SGLang inference.
Both the DFlash and EAGLE-3 draft models are trained on the **UltraChat-200K** and **ShareGPT** datasets.
#### GSM8K
| Method | 1 | 4 | 8 | 16 | 32 | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS) | 249 | 923 | 1739 | 3245 | 5349 | — |
| EAGLE-3 (10) | 1.6× | 1.5× | 1.4× | 1.2× | 1.0× | 3.49 |
| EAGLE-3 (60) | 1.9× | 1.6× | 1.3× | 0.9× | 0.6× | 4.55 |
| **DFlash (10)** | **2.4×** | **2.2×** | **2.1×** | **1.8×** | **1.6×** | **4.32** |
---
#### HumanEval
| Method | 1 | 4 | 8 | 16 | 32 | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS) | 245 | 922 | 1778 | 3336 | 5854 | — |
| EAGLE-3 (10) | 2.0× | 1.9× | 1.8× | 1.5× | 1.2× | 3.62 |
| EAGLE-3 (60) | 2.0× | 1.7× | 1.3× | 0.9× | 0.6× | 4.65 |
| **DFlash (10)** | **2.8×** | **2.6×** | **2.5×** | **2.1×** | **1.8×** | **4.91** |
---
#### Alpaca
| Method | 1 | 4 | 8 | 16 | 32 | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS) | 245 | 906 | 1745 | 3237 | 5434 | — |
| EAGLE-3 (10) | 1.5× | 1.4× | 1.4× | 1.1× | 0.9× | 3.11 |
| EAGLE-3 (60) | 1.8× | 1.5× | 1.2× | 0.8× | 0.5× | 4.07 |
| **DFlash (10)** | **2.2×** | **2.0×** | **1.8×** | **1.5×** | **1.4×** | **3.73** |
## **Acknowledgement**
We are grateful to [Yotta Labs](https://www.yottalabs.ai/) for their compute support in training this draft model.
## **Citation**
If you find DFlash useful for your research or applications, please cite our project.
```bibtex
@misc{chen2026dflash,
title = {DFlash: Block Diffusion for Flash Speculative Decoding},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
year = {2026},
eprint = {2602.06036},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2602.06036}
}
```