Update README.md

99f8175 verified 9 days ago

5.83 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- speculative-decoding
	- diffusion
	- efficiency
	- flash-decoding
	- qwen
	- diffusion-language-model
	---

	# LLaMA3.1-8B-Instruct-DFlash-UltraChat
	[Paper](https://arxiv.org/abs/2602.06036) \| [GitHub](https://github.com/z-lab/dflash) \| [Blog](https://z-lab.ai/projects/dflash/)

	DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

	This model is the drafter component. It must be used in conjunction with the target model `meta-llama/Llama-3.1-8B-Instruct`.

	<div align="center">
	<img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
	</div>

	## 📊 Training Data

	LLaMA3.1-8B-Instruct-DFlash-UltraChat is trained on Ultrachat-200K and ShareGPT datasets, aiming to align with EAGLE-3 training data. The assistant reponses in the datasets are regenerated by `meta-llama/Llama-3.1-8B-Instruct`.

	## 🚀 Quick Start

	### SGLang
	DFlash is now supported on SGLang. And vLLM integration is currently in progress.

	#### Installation
	```bash
	uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python"
	```

	#### Inference
	```bash
	export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

	python -m sglang.launch_server \
	--model-path meta-llama/Llama-3.1-8B-Instruct \
	--speculative-algorithm DFLASH \
	--speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
	--tp-size 1 \
	--dtype bfloat16 \
	--attention-backend fa3 \
	--mem-fraction-static 0.75 \
	--trust-remote-code
	```

	### Transformers

	#### Installation
	```bash
	pip install transformers==4.57.3 torch==2.9.0 accelerate
	```

	#### Inference
	```python
	from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

	model = AutoModel.from_pretrained(
	"z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat",
	trust_remote_code=True,
	dtype="auto",
	device_map="cuda:0"
	).eval()

	target = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.1-8B-Instruct",
	dtype="auto",
	device_map="cuda:0"
	).eval()

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
	prompt = "How many positive whole-number divisors does 196 have?"
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generate_ids = model.spec_generate(
	input_ids=model_inputs["input_ids"],
	max_new_tokens=2048,
	temperature=0.0,
	target=target,
	stop_token_ids=[tokenizer.eos_token_id]
	)

	print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))
	```

	## Evaluation

	DFlash consistently achieves higher speedups than the state-of-the-art speculative decoding method EAGLE-3. All experiments are conducted using SGLang on a single B200 GPU.

	For EAGLE-3, we evaluate two speculative decoding configurations:
	- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 10`
	- `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 60`, which is the official setting used in the EAGLE-3 paper.

	For DFlash, we use a block size of 10 during speculation.

	We compare against the EAGLE-3 checkpoint [lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B), which is the official EAGLE-3 checkpoint adapted for SGLang inference.

	Both the DFlash and EAGLE-3 draft models are trained on the UltraChat-200K and ShareGPT datasets.

	#### GSM8K

	\| Method \| 1 \| 4 \| 8 \| 16 \| 32 \| Avg. τ \|
	\|------------------\|-------\|-------\|-------\|-------\|-------\|--------\|
	\| Baseline (TPS) \| 249 \| 923 \| 1739 \| 3245 \| 5349 \| — \|
	\| EAGLE-3 (10) \| 1.6× \| 1.5× \| 1.4× \| 1.2× \| 1.0× \| 3.49 \|
	\| EAGLE-3 (60) \| 1.9× \| 1.6× \| 1.3× \| 0.9× \| 0.6× \| 4.55 \|
	\| DFlash (10) \| 2.4× \| 2.2× \| 2.1× \| 1.8× \| 1.6× \| 4.32 \|

	---

	#### HumanEval

	\| Method \| 1 \| 4 \| 8 \| 16 \| 32 \| Avg. τ \|
	\|------------------\|-------\|-------\|-------\|-------\|-------\|--------\|
	\| Baseline (TPS) \| 245 \| 922 \| 1778 \| 3336 \| 5854 \| — \|
	\| EAGLE-3 (10) \| 2.0× \| 1.9× \| 1.8× \| 1.5× \| 1.2× \| 3.62 \|
	\| EAGLE-3 (60) \| 2.0× \| 1.7× \| 1.3× \| 0.9× \| 0.6× \| 4.65 \|
	\| DFlash (10) \| 2.8× \| 2.6× \| 2.5× \| 2.1× \| 1.8× \| 4.91 \|

	---

	#### Alpaca

	\| Method \| 1 \| 4 \| 8 \| 16 \| 32 \| Avg. τ \|
	\|------------------\|-------\|-------\|-------\|-------\|-------\|--------\|
	\| Baseline (TPS) \| 245 \| 906 \| 1745 \| 3237 \| 5434 \| — \|
	\| EAGLE-3 (10) \| 1.5× \| 1.4× \| 1.4× \| 1.1× \| 0.9× \| 3.11 \|
	\| EAGLE-3 (60) \| 1.8× \| 1.5× \| 1.2× \| 0.8× \| 0.5× \| 4.07 \|
	\| DFlash (10) \| 2.2× \| 2.0× \| 1.8× \| 1.5× \| 1.4× \| 3.73 \|

	## Acknowledgement
	We are grateful to [Yotta Labs](https://www.yottalabs.ai/) for their compute support in training this draft model.

	## Citation
	If you find DFlash useful for your research or applications, please cite our project.

	```bibtex
	@misc{chen2026dflash,
	title = {DFlash: Block Diffusion for Flash Speculative Decoding},
	author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
	year = {2026},
	eprint = {2602.06036},
	archivePrefix = {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2602.06036}
	}
	```