File size: 25,580 Bytes
a227c91 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 | # Speculative Decoding
SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.
## Summary
### Jump to sections
- [EAGLE Decoding](#eagle-decoding)
- [EAGLE-2 Decoding](#eagle-2-decoding)
- [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile)
- [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling)
- [EAGLE-3 Decoding](#eagle-3-decoding)
- [Multi Token Prediction](#multi-token-prediction)
- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model)
- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler)
- [Ngram Speculative Decoding](#ngram-speculative-decoding)
- [Full Parameter Reference](#full-parameter-reference)
- [OOM Troubleshooting](#oom-troubleshooting)
- [References](#references)
### Quick guidance
- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`.
- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`.
- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`.
- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section).
- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`).
- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only).
- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`).
### Method comparison (mini table)
| Method | Draft source | Separate draft model? | How to enable | Notes / constraints |
|---|---|---:|---|---|
| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` |
| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Benefit varies by hardware/model; benchmark to verify |
| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab |
| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark below |
| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models |
| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` |
| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` |
| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill |
### Performance Highlights
Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).
| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |
---
## EAGLE Decoding
To enable EAGLE speculative decoding the following parameters are relevant:
| Parameter | Description | Default |
|---|---|---|
| `--speculative-draft-model-path` | Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted. | `None` |
| `--speculative-num-steps` | Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. | Auto (`5` for Llama/Grok; `3` for many other models) |
| `--speculative-eagle-topk` | Branching factor per step. Improves candidate diversity and acceptance rate, but increases memory/compute consumption. | Auto (`4` for Llama/Grok; `1` for many other models) |
| `--speculative-num-draft-tokens` | Maximum parallel verification capacity. Allows deeper tree evaluation but increases GPU memory usage. | Auto (`8` for Llama/Grok; `4` for many other models). If `topk=1`, it is adjusted to `num_steps + 1`. |
| `--speculative-accept-threshold-single` | Acceptance threshold for single-token verification. Lower values accept more aggressively. | `1.0` |
| `--speculative-accept-threshold-acc` | Accumulated acceptance threshold across steps. | `1.0` |
| `--speculative-attention-mode` | Attention mode for speculative operations (`prefill` or `decode`), affecting both target verification and draft extension. | `"prefill"` |
| `--speculative-draft-attention-backend` | Override attention backend for the draft model. | `None` (same as target) |
| `--speculative-draft-model-quantization` | Quantization method for the draft model. Use `"unquant"` to force no quantization even when the target model is quantized. | Same as target model |
| `--speculative-draft-model-revision` | Specific revision/commit of the draft model to load. | `None` (auto-set to `"main"` when `--speculative-draft-model-path` is set and revision is omitted) |
| `--speculative-draft-load-format` | Load format for the draft model weights. | `None` |
These parameters are mostly the same for EAGLE-2 and EAGLE-3. `--speculative-token-map` is ignored for EAGLE-3 models.
For `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens`: leave all three unset to use auto-tuning, or set all three explicitly when tuning.
You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).
### EAGLE-2 Decoding
You can enable EAGLE-2 Decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.
**Launch the server:**
```bash
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-2 Decoding with `torch.compile`
You can optionally enable `torch.compile` to apply kernel-level optimizations (operator fusion, autotune) to the draft model. The actual speedup depends on your hardware, model architecture, and batch size. In some configurations (e.g., small draft models on H100 where cuBLAS is already optimal and CUDA graphs are enabled), the benefit may be negligible. We recommend benchmarking with and without this flag on your specific setup to verify whether it helps.
To enable it, add `--enable-torch-compile` and optionally set `--torch-compile-max-bs`:
```bash
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--enable-torch-compile \
--torch-compile-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling
By employing a truncated high-frequency token vocabulary in the draft model, EAGLE speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, check out [the paper](https://arxiv.org/pdf/2502.14856).
In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency tokens in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency tokens by directly downloading these tokens from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).
Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx).
```bash
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--dtype float16 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-3 Decoding
You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.
```bash
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--dtype float16 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Multi Token Prediction
We support [MTP (Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use `XiaomiMiMo/MiMo-7B-RL` as an example here (for DeepSeek MTP usage, refer to [deepseek_v32 doc](../basic_usage/deepseek_v32.md#multi-token-prediction)).
```bash
python3 -m sglang.launch_server \
--model XiaomiMiMo/MiMo-7B-RL \
--host 0.0.0.0 \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "XiaomiMiMo/MiMo-7B-RL",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print(response.json())
```
---
## Standalone Speculative Decoding (Small Draft Model)
Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`.
Relevant parameters:
| Parameter | Description | Default |
|---|---|---|
| `--speculative-draft-model-path` | Draft model weights (smaller than the target model). | `None` |
| `--speculative-num-steps` | Draft depth (how many steps the draft model runs autoregressively). | `3` (auto default for STANDALONE) |
| `--speculative-eagle-topk` | Branching factor (token candidates per step). | `1` (auto default for STANDALONE) |
| `--speculative-num-draft-tokens` | Verification capacity. | `4` (auto default for STANDALONE) |
| `--speculative-draft-model-quantization` | Quantization for the draft model. Use `"unquant"` to disable quantization on the draft even when the target is quantized. | Same as target |
> **Note:** Standalone speculative decoding currently **does not support** `--enable-dp-attention`.
```bash
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm STANDALONE \
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
--speculative-num-steps 4 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 7 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Speculative Decoding V2 (Overlap Scheduler)
SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`).
To enable it, set the environment variable:
- `SGLANG_ENABLE_SPEC_V2=True`
Notes:
- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**.
- If you explicitly set `--speculative-eagle-topk > 1`, the server will error.
- If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama). This is incompatible with SpecV2 and may not always trigger an immediate config error, so set `--speculative-eagle-topk 1` explicitly.
- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`.
```bash
SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm STANDALONE \
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Ngram Speculative Decoding
SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model.
Enable it with:
- `--speculative-algorithm NGRAM`
### Ngram-specific parameters
| Parameter | Description | Default |
|---|---|---|
| `--speculative-num-draft-tokens` | Number of draft tokens verified per step. If omitted, defaults to `--speculative-ngram-max-match-window-size`. | `12` (with default ngram settings) |
| `--speculative-ngram-min-match-window-size` | Minimum matching window size. | `1` |
| `--speculative-ngram-max-match-window-size` | Maximum matching window size. | `12` |
| `--speculative-ngram-min-bfs-breadth` | Minimum BFS breadth. | `1` |
| `--speculative-ngram-max-bfs-breadth` | Maximum BFS breadth. | `10` |
| `--speculative-ngram-match-type` | Match type: `"BFS"` or `"PROB"`. | `"BFS"` |
| `--speculative-ngram-branch-length` | How many recent tokens to insert into the cache. | `18` |
| `--speculative-ngram-capacity` | Cache capacity (number of entries). | `10,000,000` |
Notes:
- Ngram speculative decoding **only supports CUDA**.
- It currently **does not support** `--enable-dp-attention`.
- It disables the overlap scheduler and mixed chunked prefill.
- If `--speculative-ngram-max-bfs-breadth > 1` (thus `speculative_eagle_topk > 1`) and `page_size > 1`, use `--attention-backend flashinfer`; otherwise the server will error.
- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification.
```bash
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm NGRAM \
--speculative-num-draft-tokens 16 \
--speculative-ngram-max-match-window-size 12 \
--speculative-ngram-max-bfs-breadth 10 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Full Parameter Reference
Below is a comprehensive list of all speculative decoding parameters available in SGLang:
### Core parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| `--speculative-algorithm` | `str` | `None` | Algorithm to use: `EAGLE`, `EAGLE3`, `STANDALONE`, `NGRAM`, `NEXTN` (alias of `EAGLE`) |
| `--speculative-draft-model-path` | `str` | `None` | Path to the draft model weights |
| `--speculative-draft-model-revision` | `str` | `None` | Specific revision/commit of the draft model (`"main"` is auto-used when draft path is set and revision is omitted) |
| `--speculative-draft-load-format` | `str` | `None` | Load format for draft model weights |
| `--speculative-num-steps` | `int` | `None` (auto-chosen when omitted) | Autoregressive drafting depth |
| `--speculative-eagle-topk` | `int` | `None` (auto-chosen when omitted) | Branching factor per drafting step |
| `--speculative-num-draft-tokens` | `int` | `None` (auto-chosen when omitted) | Maximum number of draft tokens for verification |
| `--speculative-accept-threshold-single` | `float` | `1.0` | Single-token acceptance threshold |
| `--speculative-accept-threshold-acc` | `float` | `1.0` | Accumulated acceptance threshold |
| `--speculative-token-map` | `str` | `None` | Path to FR-Spec high-frequency token map |
| `--speculative-attention-mode` | `str` | `"prefill"` | Attention mode for speculative operations (`"prefill"` or `"decode"`) |
| `--speculative-draft-attention-backend` | `str` | `None` | Override attention backend for the draft model |
| `--speculative-moe-runner-backend` | `str` | `None` | MoE runner backend for the draft model |
| `--speculative-moe-a2a-backend` | `str` | `None` | MoE all-to-all backend for the draft model |
| `--speculative-draft-model-quantization` | `str` | Same as target | Quantization for the draft model (`"unquant"` to disable) |
### Ngram-specific parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| `--speculative-ngram-min-match-window-size` | `int` | `1` | Minimum ngram matching window |
| `--speculative-ngram-max-match-window-size` | `int` | `12` | Maximum ngram matching window |
| `--speculative-ngram-min-bfs-breadth` | `int` | `1` | Minimum BFS breadth |
| `--speculative-ngram-max-bfs-breadth` | `int` | `10` | Maximum BFS breadth |
| `--speculative-ngram-match-type` | `str` | `"BFS"` | Match type: `"BFS"` or `"PROB"` |
| `--speculative-ngram-branch-length` | `int` | `18` | Recent tokens to insert into cache |
| `--speculative-ngram-capacity` | `int` | `10,000,000` | Cache capacity |
### Environment variables
| Variable | Default | Description |
|---|---|---|
| `SGLANG_ENABLE_SPEC_V2` | `False` | Enable Speculative Decoding V2 (overlap scheduler) |
| `SGLANG_NGRAM_FORCE_GREEDY_VERIFY` | `False` | Force greedy verification for ngram decoding |
### Other related flags
| Parameter | Description |
|---|---|
| `--enable-multi-layer-eagle` | Enable multi-layer EAGLE (auto-enabled for MiMoV2 and Step3p5 models) |
| `--enable-torch-compile` | Enable `torch.compile` for kernel-level optimizations |
| `--torch-compile-max-bs` | Maximum batch size for `torch.compile` |
---
## OOM Troubleshooting
> [!WARNING]
> **Out of Memory (OOM)?** Speculative decoding may increase GPU memory usage because the draft tree, CUDA graphs, and verification-related buffers consume additional VRAM. If you encounter OOM errors, try the following adjustments.
### Step 1: Lower static memory fraction (most effective)
```bash
--mem-fraction-static 0.5 # when omitted, this value is auto-computed
```
- `--mem-fraction-static` controls the memory budget for model weights + KV cache pool.
- Lowering it directly increases dynamic headroom for activations and CUDA graph buffers.
- If omitted, SGLang auto-estimates this value from other settings, and those auto settings can still be too aggressive for some workloads.
### Step 2: Reduce CUDA graph batch size
```bash
# Fewer CUDA graph captures = less memory reserved
--cuda-graph-max-bs 4 # or even 2 for tight memory situations
```
- If omitted, `--cuda-graph-max-bs` is auto-selected based on GPU memory and TP size, and can be much larger on high-memory GPUs.
### Step 3: Reduce draft tree size
These three parameters directly control how much memory the draft tree consumes:
```bash
# Before (aggressive, high memory)
--speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64
# After (conservative, lower memory)
--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
```
### Step 4: Limit concurrent requests
```bash
# Fewer concurrent requests lowers in-flight load and can reduce OOM risk
--max-running-requests 4
```
### Quick OOM recovery recipe
If you're hitting OOM and just want something that works, start with this minimal configuration and scale up:
```bash
python3 -m sglang.launch_server \
--model <your-model> \
--speculative-algorithm EAGLE \
--speculative-draft-model-path <your-draft-model> \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--cuda-graph-max-bs 2 \
--mem-fraction-static 0.5 \
--max-running-requests 4 \
--log-level warning
```
Then gradually increase `--speculative-num-draft-tokens`, `--speculative-eagle-topk`, and `--cuda-graph-max-bs`. Increase `--mem-fraction-static` last, only after the run is stable.
---
## References
EAGLE process is as follows:
- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$.
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- In SGLang's EAGLE-2 implementation, the draft tree is expanded for the configured steps and then reranked to select the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.
This enhances drafting accuracy by operating on features instead of tokens for more regular inputs and by additionally passing tokens from the next timestep to reduce sampling randomness. For more details, see the [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) papers.
For guidance on how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train). For EAGLE-3 training specifically, check out [SpecForge](https://github.com/sgl-project/SpecForge), the SGLang team's training framework designed for EAGLE-3 speculative decoding models with seamless porting to SGLang serving. See the [SpecForge documentation](https://docs.sglang.ai/SpecForge/) and [blog post](https://lmsys.org/blog/2025-07-25-spec-forge) for details.
|