Files changed (1) hide show
  1. README.md +18 -3
README.md CHANGED
@@ -74,13 +74,22 @@ Benchmarks were run using [SpecForge](https://github.com/sgl-project/SpecForge/b
74
  ### Requirements
75
 
76
  - NVIDIA GPU with CUDA 12.0+
 
77
  - [SGLang](https://github.com/sgl-project/sglang) ≥ 0.5.8
78
 
79
- ### Launch Server
 
 
 
 
 
 
 
 
80
 
81
  ```bash
82
  python -m sglang.launch_server \
83
- --model-path /path/to/Kimi-K2.5 \
84
  --tp 8 \
85
  --trust-remote-code \
86
  --speculative-algorithm EAGLE3 \
@@ -96,7 +105,7 @@ python -m sglang.launch_server \
96
 
97
  ```bash
98
  python bench_eagle3.py \
99
- --model-path /path/to/Kimi-K2.5 \
100
  --port 30000 \
101
  --config-list 1,3,1,4 \
102
  --benchmark-list <benchmark_name> \
@@ -104,3 +113,9 @@ python bench_eagle3.py \
104
  ```
105
 
106
  `--config-list` format: `topk,num_steps,topk,num_draft_tokens`.
 
 
 
 
 
 
 
74
  ### Requirements
75
 
76
  - NVIDIA GPU with CUDA 12.0+
77
+ - [vLLM](https://github.com/vllm-project/vllm) >= (0.18.0) or you can install the [nightly wheel/docker image](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#install-the-latest-code).
78
  - [SGLang](https://github.com/sgl-project/sglang) ≥ 0.5.8
79
 
80
+ ### Launch Server (vLLM)
81
+ ```bash
82
+ vllm serve moonshotai/Kimi-K2.5 \
83
+ --tensor-parallel-size 8 \
84
+ --speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
85
+ --trust-remote-code
86
+ ```
87
+
88
+ ### Launch Server (SGLang)
89
 
90
  ```bash
91
  python -m sglang.launch_server \
92
+ --model-path moonshotai/Kimi-K2.5 \
93
  --tp 8 \
94
  --trust-remote-code \
95
  --speculative-algorithm EAGLE3 \
 
105
 
106
  ```bash
107
  python bench_eagle3.py \
108
+ --model-path moonshotai/Kimi-K2.5 \
109
  --port 30000 \
110
  --config-list 1,3,1,4 \
111
  --benchmark-list <benchmark_name> \
 
113
  ```
114
 
115
  `--config-list` format: `topk,num_steps,topk,num_draft_tokens`.
116
+
117
+ ### Metrics
118
+ The same underlying run would produce different reported numbers from the two engines because:
119
+
120
+ - SGLang (`accept len`) adds +1 to each round's count (including the guaranteed target-model token), then averages.
121
+ - vLLM (`mean acceptance length`) does not add +1 — it counts only accepted draft tokens.