lightseekorg
/

kimi-k2.5-eagle3

speculative-decoding

Model card Files Files and versions

Update README.md

#2

by rogerwyf - opened Mar 16

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +18 -3

README.md CHANGED Viewed

@@ -74,13 +74,22 @@ Benchmarks were run using [SpecForge](https://github.com/sgl-project/SpecForge/b
 ### Requirements
 - NVIDIA GPU with CUDA 12.0+
 - [SGLang](https://github.com/sgl-project/sglang) ≥ 0.5.8
-### Launch Server
 ```bash
 python -m sglang.launch_server \
-    --model-path /path/to/Kimi-K2.5 \
     --tp 8 \
     --trust-remote-code \
     --speculative-algorithm EAGLE3 \
@@ -96,7 +105,7 @@ python -m sglang.launch_server \
 ```bash
 python bench_eagle3.py \
-    --model-path /path/to/Kimi-K2.5 \
     --port 30000 \
     --config-list 1,3,1,4 \
     --benchmark-list <benchmark_name> \
@@ -104,3 +113,9 @@ python bench_eagle3.py \
 ```
 `--config-list` format: `topk,num_steps,topk,num_draft_tokens`.

 ### Requirements
 - NVIDIA GPU with CUDA 12.0+
+- [vLLM](https://github.com/vllm-project/vllm) >= (0.18.0) or you can install the [nightly wheel/docker image](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#install-the-latest-code).
 - [SGLang](https://github.com/sgl-project/sglang) ≥ 0.5.8
+### Launch Server (vLLM)
+```bash
+vllm serve moonshotai/Kimi-K2.5 \
+    --tensor-parallel-size 8 \
+    --speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
+    --trust-remote-code
+```
+### Launch Server (SGLang)
 ```bash
 python -m sglang.launch_server \
+    --model-path moonshotai/Kimi-K2.5 \
     --tp 8 \
     --trust-remote-code \
     --speculative-algorithm EAGLE3 \
 ```bash
 python bench_eagle3.py \
+    --model-path moonshotai/Kimi-K2.5 \
     --port 30000 \
     --config-list 1,3,1,4 \
     --benchmark-list <benchmark_name> \
 ```
 `--config-list` format: `topk,num_steps,topk,num_draft_tokens`.
+### Metrics
+The same underlying run would produce different reported numbers from the two engines because:
+- SGLang (`accept len`) adds +1 to each round's count (including the guaranteed target-model token), then averages.
+- vLLM (`mean acceptance length`) does not add +1 — it counts only accepted draft tokens.