z-lab
/

MiniMax-M2.7-DFlash

Text Generation

speculative-decoding

block-diffusion

diffusion-language-model

text-generation-inference

Model card Files Files and versions

MiniMax-M2.7-DFlash / README.md

jianchen0311's picture

Update DFlash model card

2646b9e verified 7 days ago

|

History Blame Contribute Delete

3.56 kB

	---
	license: other
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- dflash
	- speculative-decoding
	- block-diffusion
	- draft-model
	- efficiency
	- minimax
	- minimax_m2
	- diffusion-language-model
	---

	# MiniMax-M2.7-DFlash

	[Paper](https://arxiv.org/abs/2602.06036) \| [GitHub](https://github.com/z-lab/dflash) \| [Blog](https://z-lab.ai/projects/dflash/)

	DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel. This is the drafter model, which must be paired with [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7).

	<div align="center">
	<img src="https://huggingface.co/z-lab/gemma-4-31B-it-DFlash/resolve/main/assets/dflash_system.png" alt="DFlash Architecture" width="85%">
	</div>

	## Quick Start

	### Installation

	vLLM:

	Check out [vLLM issue #46105](https://github.com/vllm-project/vllm/issues/46105).

	SGLang:

	```bash
	uv pip install "git+https://github.com/sgl-project/sglang.git#subdirectory=python"
	```

	### Launch Server

	vLLM:

	Check out [vLLM issue #46105](https://github.com/vllm-project/vllm/issues/46105).

	SGLang:

	```bash
	python -m sglang.launch_server \
	--model-path MiniMaxAI/MiniMax-M2.7 \
	--tp-size 4 \
	--speculative-algorithm DFLASH \
	--speculative-draft-model-path z-lab/MiniMax-M2.7-DFlash \
	--attention-backend trtllm_mha \
	--speculative-draft-attention-backend fa4 \
	--mem-fraction-static 0.8 \
	--trust-remote-code \
	--host 0.0.0.0 \
	--port 30000
	```

	### Usage

	For SGLang, use port `30000`.

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

	response = client.chat.completions.create(
	model="MiniMaxAI/MiniMax-M2.7",
	messages=[{"role": "user", "content": "Write a quicksort in Python."}],
	max_tokens=4096,
	temperature=0.0,
	extra_body={"chat_template_kwargs": {"enable_thinking": True}},
	)
	print(response.choices[0].message.content)
	```

	## Benchmark Results

	Setup: 4 NVIDIA B200 GPUs per server/run, SGLang, tensor parallel size 4, target attention backend `trtllm_mha`, draft attention backend `fa4`, thinking enabled, max output length 4096, greedy decoding. Concurrency 1 uses 128 prompts; concurrency 32 uses 1024 prompts.

	### Throughput

	_Generated tokens/sec_

	Block Size = 8

	\| Task \| Concurrency \| DFlash \|
	\|---\|---:\|---:\|
	\| Math500 \| 1 \| 331.12 \|
	\| \| 32 \| 4422.52 \|
	\| GSM8K \| 1 \| 304.07 \|
	\| \| 32 \| 4202.09 \|
	\| HumanEval \| 1 \| 333.44 \|
	\| \| 32 \| 4394.23 \|
	\| MT-Bench \| 1 \| 350.84 \|
	\| \| 32 \| 4549.75 \|

	### Acceptance Length

	\| Task \| c1 \| c32 \|
	\|---\|---:\|---:\|
	\| Math500 \| 3.561 \| 3.658 \|
	\| GSM8K \| 3.481 \| 3.586 \|
	\| HumanEval \| 3.610 \| 3.657 \|
	\| MT-Bench \| 3.550 \| 3.624 \|

	## Acknowledgements

	Special thanks to [David Wang](https://davidwa.ng/) for his outstanding engineering support on this project. We are also grateful to [Modal](https://modal.com/), [InnoMatrix](https://innomatrix.ai), and [Yotta Labs](https://www.yottalabs.ai/) for providing the compute resources used to train this draft model.

	## Citation

	If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: [DFlash Feedback](https://forms.gle/4YNwfqb4nJdqn6hq9).

	```bibtex
	@article{chen2026dflash,
	title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
	author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
	journal = {arXiv preprint arXiv:2602.06036},
	year = {2026}
	}
	```