# 投机采样


Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.

投机采样是加速 rollout 的重要优化手段，目前 slime 支持不通过训练更新 draft model 式的投机采样。

对于有 MTP 层支持的模型（例如，GLM-4.6、Deepseek-V3/R1），只需要添加：

```bash
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
```

如果要使用单独训练的 draft model（例如 [SpecForge](https://docs.sglang.ai/SpecForge/) 训练的），还需要额外设置：

```bash
--speculative-draft-model-path /your/draft/model/path
```

详细参数含义及配置方法，请参考 SGLang 的 speculative decoding [文档](https://docs.sglang.ai/advanced_features/speculative_decoding.html)

### 已知问题
[SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) 或 [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
- 报错发生在 speculative decoding draft 阶段的 cuda graph padding
- 解决方法: 
	1. 切换推理后端为 fa3 triton。该 bug 仅发生在 flashInfer 。
	2. 覆盖更宽的 `--sglang-cuda-graph-bs` 来避免某些 batch size 做 cuda graph padding
	3. 禁用 cuda graph（性能损失太大，不推荐）
	4. Notice：禁用 cuda graph padding `--sglang-disable-cuda-graph-padding` 目前对 speculative decoding 不生效。参考 [SGLang cuda_graph_runner.py](tbd)
- 如需 debug，可尝试开启 slime 的 `--debug-rollout-only` 参数，来排除参数更新或模型 offload 的影响
```bash
# if speculative decoding has bug, this can help debug
--debug-rollout-only

# If flashInfer has bug with speculative decoding, use fa3 or triton instead
--sglang-attention-backend fa3

# If bug exists when cuda graph do padding, extend the cuda graph batch size
--sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)

# Improve performance by enlarging running batch size limit
--sglang-max-running-requests 128
```
[SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)
- 解决方法：
	1. 应用最新的 sglang patch。
	2. 参考这个 pr 修改 sglang https://github.com/sgl-project/sglang/pull/9687 
[SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)
- 如果使用外部 draft model 出现 illegal memory access，可能是由于 draft model 和 target model 的 context length 不匹配导致的 bug。
- 请更新 SGLang >= 0.5.1 来应用这个 PR。（并更新 `sgl-kernel`）