---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- speculative-decoding
- diffusion
- efficiency
- flash-decoding
- qwen
- diffusion-language-model
---

# LLaMA3.1-8B-Instruct-DFlash-UltraChat
[**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)

**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

This model is the **drafter** component. It must be used in conjunction with the target model `meta-llama/Llama-3.1-8B-Instruct`.

<div align="center">
  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
</div>

## 📊 Training Data

**LLaMA3.1-8B-Instruct-DFlash-UltraChat** is trained on **Ultrachat-200K** and **ShareGPT** datasets, aiming to align with EAGLE-3 training data. The assistant reponses in the datasets are regenerated by `meta-llama/Llama-3.1-8B-Instruct`.

## 🚀 Quick Start

### SGLang
DFlash is now supported on SGLang. And vLLM integration is currently in progress.

#### Installation
```bash
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
```

#### Inference
```bash
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
    --tp-size 1 \
    --dtype bfloat16 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code
```

### Transformers

#### Installation
```bash
pip install transformers==4.57.3 torch==2.9.0 accelerate
```

#### Inference
```python
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

model = AutoModel.from_pretrained(
    "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", 
    trust_remote_code=True, 
    dtype="auto", 
    device_map="cuda:0"
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", 
    dtype="auto", 
    device_map="cuda:0"
).eval()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generate_ids = model.spec_generate(
    input_ids=model_inputs["input_ids"], 
    max_new_tokens=2048, 
    temperature=0.0, 
    target=target, 
    stop_token_ids=[tokenizer.eos_token_id]
)

print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))
```

## Evaluation

DFlash consistently achieves higher speedups than the state-of-the-art speculative decoding method **EAGLE-3**. All experiments are conducted using **SGLang** on a single **B200 GPU**.

For EAGLE-3, we evaluate two speculative decoding configurations:
- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 10`
- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 60`, which is the **official** setting used in the EAGLE-3 paper.

For DFlash, we use a block size of 10 during speculation.

We compare against the EAGLE-3 checkpoint [lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B), which is the **official** EAGLE-3 checkpoint adapted for SGLang inference.

Both the DFlash and EAGLE-3 draft models are trained on the **UltraChat-200K** and **ShareGPT** datasets.

#### GSM8K

| Method           | 1     | 4     | 8     | 16    | 32    | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS)   | 249   | 923   | 1739  | 3245  | 5349  | —      |
| EAGLE-3 (10)     | 1.6×  | 1.5×  | 1.4×  | 1.2×  | 1.0×  | 3.49   |
| EAGLE-3 (60)     | 1.9×  | 1.6×  | 1.3×  | 0.9×  | 0.6×  | 4.55   |
| **DFlash (10)**  | **2.4×** | **2.2×** | **2.1×** | **1.8×** | **1.6×** | **4.32** |

---

#### HumanEval

| Method           | 1     | 4     | 8     | 16    | 32    | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS)   | 245   | 922   | 1778  | 3336  | 5854  | —      |
| EAGLE-3 (10)     | 2.0×  | 1.9×  | 1.8×  | 1.5×  | 1.2×  | 3.62   |
| EAGLE-3 (60)     | 2.0×  | 1.7×  | 1.3×  | 0.9×  | 0.6×  | 4.65   |
| **DFlash (10)**  | **2.8×** | **2.6×** | **2.5×** | **2.1×** | **1.8×** | **4.91** |

---

#### Alpaca

| Method           | 1     | 4     | 8     | 16    | 32    | Avg. τ |
|------------------|-------|-------|-------|-------|-------|--------|
| Baseline (TPS)   | 245   | 906   | 1745  | 3237  | 5434  | —      |
| EAGLE-3 (10)     | 1.5×  | 1.4×  | 1.4×  | 1.1×  | 0.9×  | 3.11   |
| EAGLE-3 (60)     | 1.8×  | 1.5×  | 1.2×  | 0.8×  | 0.5×  | 4.07   |
| **DFlash (10)**  | **2.2×** | **2.0×** | **1.8×** | **1.5×** | **1.4×** | **3.73** |

## **Acknowledgement**
We are grateful to [Yotta Labs](https://www.yottalabs.ai/) for their compute support in training this draft model.

## **Citation**
If you find DFlash useful for your research or applications, please cite our project.

```bibtex
@misc{chen2026dflash,
  title         = {DFlash: Block Diffusion for Flash Speculative Decoding},
  author        = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  year          = {2026},
  eprint        = {2602.06036},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2602.06036}
}
```