Instructions to use z-lab/Kimi-K2.5-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use z-lab/Kimi-K2.5-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="z-lab/Kimi-K2.5-DFlash", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("z-lab/Kimi-K2.5-DFlash", trust_remote_code=True)
model = AutoModel.from_pretrained("z-lab/Kimi-K2.5-DFlash", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use z-lab/Kimi-K2.5-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "z-lab/Kimi-K2.5-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Kimi-K2.5-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/z-lab/Kimi-K2.5-DFlash

SGLang

How to use z-lab/Kimi-K2.5-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "z-lab/Kimi-K2.5-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Kimi-K2.5-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "z-lab/Kimi-K2.5-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Kimi-K2.5-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use z-lab/Kimi-K2.5-DFlash with Docker Model Runner:
```
docker model run hf.co/z-lab/Kimi-K2.5-DFlash
```

jianchen0311 commited on Apr 18

Commit

e2db14d

verified ·

1 Parent(s): 42cd36d

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -16

README.md CHANGED Viewed

@@ -15,8 +15,6 @@ tags:
 # Kimi-K2.5-DFlash
 [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
-**This model is still under training.**
 **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
 This model is the **drafter** component. It must be used in conjunction with the target model `moonshotai/Kimi-K2.5`.
@@ -29,6 +27,11 @@ This model is the **drafter** component. It must be used in conjunction with the
 ### Installation
 vLLM:
 ```bash
 uv pip install vllm
@@ -37,21 +40,8 @@ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vll
 Please refer to [PR39930](https://github.com/vllm-project/vllm/pull/39930) to see how to use DFlash with Kimi-K2.5 on vLLM.
-SGLang:
-```bash
-uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
-```
 ### Launch Server
-vLLM:
-```bash
-vllm serve moonshotai/Kimi-K2.5 \
-  --speculative-config '{"method": "dflash", "model": "z-lab/Kimi-K2.5-DFlash", "num_speculative_tokens": 7}' \
-  --attention-backend flashinfer \
-  --max-num-batched-tokens 32768
-```
 SGLang:
 ```bash
 # Optional: enable schedule overlapping (experimental, may not be stable)
@@ -89,7 +79,7 @@ print(response.choices[0].message.content)
 - Thinking: enabled
 - Max new tokens: 4096
 - Block size: 8
-- SGLang results. vLLM results might be different.
 | Dataset   | Accept Length |
 |-----------|---------------|

 # Kimi-K2.5-DFlash
 [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
 **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
 This model is the **drafter** component. It must be used in conjunction with the target model `moonshotai/Kimi-K2.5`.
 ### Installation
+SGLang:
+```bash
+uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
+```
 vLLM:
 ```bash
 uv pip install vllm
 Please refer to [PR39930](https://github.com/vllm-project/vllm/pull/39930) to see how to use DFlash with Kimi-K2.5 on vLLM.
 ### Launch Server
 SGLang:
 ```bash
 # Optional: enable schedule overlapping (experimental, may not be stable)
 - Thinking: enabled
 - Max new tokens: 4096
 - Block size: 8
+- SGLang results.
 | Dataset   | Accept Length |
 |-----------|---------------|