--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - speculative-decoding - diffusion - efficiency - flash-decoding - qwen - diffusion-language-model --- # LLaMA3.1-8B-Instruct-DFlash-UltraChat [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/) **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed. This model is the **drafter** component. It must be used in conjunction with the target model `meta-llama/Llama-3.1-8B-Instruct`.
DFlash Architecture
## 📊 Training Data **LLaMA3.1-8B-Instruct-DFlash-UltraChat** is trained on **Ultrachat-200K** and **ShareGPT** datasets, aiming to align with EAGLE-3 training data. The assistant reponses in the datasets are regenerated by `meta-llama/Llama-3.1-8B-Instruct`. ## 🚀 Quick Start ### SGLang DFlash is now supported on SGLang. And vLLM integration is currently in progress. #### Installation ```bash uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/head#subdirectory=python" ``` #### Inference ```bash export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --speculative-algorithm DFLASH \ --speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \ --tp-size 1 \ --dtype bfloat16 \ --attention-backend fa3 \ --mem-fraction-static 0.75 \ --trust-remote-code ``` ### Transformers #### Installation ```bash pip install transformers==4.57.3 torch==2.9.0 accelerate ``` #### Inference ```python from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer model = AutoModel.from_pretrained( "z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat", trust_remote_code=True, dtype="auto", device_map="cuda:0" ).eval() target = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", dtype="auto", device_map="cuda:0" ).eval() tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") prompt = "How many positive whole-number divisors does 196 have?" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generate_ids = model.spec_generate( input_ids=model_inputs["input_ids"], max_new_tokens=2048, temperature=0.0, target=target, stop_token_ids=[tokenizer.eos_token_id] ) print(tokenizer.decode(generate_ids[0], skip_special_tokens=True)) ``` ## Evaluation DFlash consistently achieves higher speedups than the state-of-the-art speculative decoding method **EAGLE-3**. All experiments are conducted using **SGLang** on a single **B200 GPU**. For EAGLE-3, we evaluate two speculative decoding configurations: - `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 10` - `--speculative-num-steps 7`, `--speculative-eagle-topk 10`, `--speculative-num-draft-tokens 60`, which is the **official** setting used in the EAGLE-3 paper. For DFlash, we use a block size of 10 during speculation. We compare against the EAGLE-3 checkpoint [lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B), which is the **official** EAGLE-3 checkpoint adapted for SGLang inference. Both the DFlash and EAGLE-3 draft models are trained on the **UltraChat-200K** and **ShareGPT** datasets. #### GSM8K | Method | 1 | 4 | 8 | 16 | 32 | Avg. τ | |------------------|-------|-------|-------|-------|-------|--------| | Baseline (TPS) | 249 | 923 | 1739 | 3245 | 5349 | — | | EAGLE-3 (10) | 1.6× | 1.5× | 1.4× | 1.2× | 1.0× | 3.49 | | EAGLE-3 (60) | 1.9× | 1.6× | 1.3× | 0.9× | 0.6× | 4.55 | | **DFlash (10)** | **2.4×** | **2.2×** | **2.1×** | **1.8×** | **1.6×** | **4.32** | --- #### HumanEval | Method | 1 | 4 | 8 | 16 | 32 | Avg. τ | |------------------|-------|-------|-------|-------|-------|--------| | Baseline (TPS) | 245 | 922 | 1778 | 3336 | 5854 | — | | EAGLE-3 (10) | 2.0× | 1.9× | 1.8× | 1.5× | 1.2× | 3.62 | | EAGLE-3 (60) | 2.0× | 1.7× | 1.3× | 0.9× | 0.6× | 4.65 | | **DFlash (10)** | **2.8×** | **2.6×** | **2.5×** | **2.1×** | **1.8×** | **4.91** | --- #### Alpaca | Method | 1 | 4 | 8 | 16 | 32 | Avg. τ | |------------------|-------|-------|-------|-------|-------|--------| | Baseline (TPS) | 245 | 906 | 1745 | 3237 | 5434 | — | | EAGLE-3 (10) | 1.5× | 1.4× | 1.4× | 1.1× | 0.9× | 3.11 | | EAGLE-3 (60) | 1.8× | 1.5× | 1.2× | 0.8× | 0.5× | 4.07 | | **DFlash (10)** | **2.2×** | **2.0×** | **1.8×** | **1.5×** | **1.4×** | **3.73** | ## **Acknowledgement** We are grateful to [Yotta Labs](https://www.yottalabs.ai/) for their compute support in training this draft model. ## **Citation** If you find DFlash useful for your research or applications, please cite our project. ```bibtex @misc{chen2026dflash, title = {DFlash: Block Diffusion for Flash Speculative Decoding}, author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian}, year = {2026}, eprint = {2602.06036}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2602.06036} } ```