jianchen0311 commited on
Commit
e2db14d
·
verified ·
1 Parent(s): 42cd36d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -16
README.md CHANGED
@@ -15,8 +15,6 @@ tags:
15
  # Kimi-K2.5-DFlash
16
  [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
17
 
18
- **This model is still under training.**
19
-
20
  **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
21
 
22
  This model is the **drafter** component. It must be used in conjunction with the target model `moonshotai/Kimi-K2.5`.
@@ -29,6 +27,11 @@ This model is the **drafter** component. It must be used in conjunction with the
29
 
30
  ### Installation
31
 
 
 
 
 
 
32
  vLLM:
33
  ```bash
34
  uv pip install vllm
@@ -37,21 +40,8 @@ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vll
37
 
38
  Please refer to [PR39930](https://github.com/vllm-project/vllm/pull/39930) to see how to use DFlash with Kimi-K2.5 on vLLM.
39
 
40
- SGLang:
41
- ```bash
42
- uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
43
- ```
44
-
45
  ### Launch Server
46
 
47
- vLLM:
48
- ```bash
49
- vllm serve moonshotai/Kimi-K2.5 \
50
- --speculative-config '{"method": "dflash", "model": "z-lab/Kimi-K2.5-DFlash", "num_speculative_tokens": 7}' \
51
- --attention-backend flashinfer \
52
- --max-num-batched-tokens 32768
53
- ```
54
-
55
  SGLang:
56
  ```bash
57
  # Optional: enable schedule overlapping (experimental, may not be stable)
@@ -89,7 +79,7 @@ print(response.choices[0].message.content)
89
  - Thinking: enabled
90
  - Max new tokens: 4096
91
  - Block size: 8
92
- - SGLang results. vLLM results might be different.
93
 
94
  | Dataset | Accept Length |
95
  |-----------|---------------|
 
15
  # Kimi-K2.5-DFlash
16
  [**Paper**](https://arxiv.org/abs/2602.06036) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
17
 
 
 
18
  **DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
19
 
20
  This model is the **drafter** component. It must be used in conjunction with the target model `moonshotai/Kimi-K2.5`.
 
27
 
28
  ### Installation
29
 
30
+ SGLang:
31
+ ```bash
32
+ uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
33
+ ```
34
+
35
  vLLM:
36
  ```bash
37
  uv pip install vllm
 
40
 
41
  Please refer to [PR39930](https://github.com/vllm-project/vllm/pull/39930) to see how to use DFlash with Kimi-K2.5 on vLLM.
42
 
 
 
 
 
 
43
  ### Launch Server
44
 
 
 
 
 
 
 
 
 
45
  SGLang:
46
  ```bash
47
  # Optional: enable schedule overlapping (experimental, may not be stable)
 
79
  - Thinking: enabled
80
  - Max new tokens: 4096
81
  - Block size: 8
82
+ - SGLang results.
83
 
84
  | Dataset | Accept Length |
85
  |-----------|---------------|