JustinTX
/

shinka-backup

Model card Files Files and versions

shinka-backup / ccevolve /baselines /thetaevolve /docs /en /advanced /speculative-decoding.md

JustinTX's picture

Add files using upload-large-folder tool

d7b3a74 verified about 1 month ago

|

history blame contribute delete

2.63 kB

	# Speculative Decoding

	Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.

	For model with MTP layer (e.g. GLM-4.6, Deepseek-V3/R1), you can run with:

	```bash
	--sglang-speculative-algorithm EAGLE
	--sglang-speculative-num-steps 3
	--sglang-speculative-eagle-topk 1
	--sglang-speculative-num-draft-tokens 4
	```

	And for external draft model (e.g. draft models from [SpecForge](https://docs.sglang.ai/SpecForge/)), you need also pass:

	```bash
	--speculative-draft-model-path /your/draft/model/path
	```

	For details on parameter meanings and configuration, see the [SGLang speculative decoding documentation](https://docs.sglang.ai/advanced_features/speculative_decoding.html).

	### Known Issues

	#### [SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) or [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)

	* Error occurs during CUDA graph padding in the speculative decoding draft stage.
	* Workarounds:

	1. Switch the inference backend to fa3 Triton (bug only occurs in flashInfer).
	2. Specify a broader range for `--sglang-cuda-graph-bs` to avoid batch sizes that trigger CUDA graph padding.
	3. Disable CUDA graph (not recommended due to significant performance loss).
	4. Notice: Disabling CUDA graph padding with `--sglang-disable-cuda-graph-padding` is currently ineffective for speculative decoding. See [SGLang `cuda_graph_runner.py`](tbd).
	* For debugging, enable slime’s `--debug-rollout-only` flag to isolate rollout behavior from parameter updates or model offloading.

	```bash
	# If speculative decoding fails, this can help debug
	--debug-rollout-only

	# If flashInfer causes issues with speculative decoding, use fa3 or triton instead
	--sglang-attention-backend fa3

	# If CUDA graph fails due to padding, extend the CUDA graph batch size
	--sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)

	# Improve performance by enlarging the running batch size limit
	--sglang-max-running-requests 128
	```

	#### [SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)

	* Solution:

	1. Apply the latest SGLang patch.
	2. See [PR #9687](https://github.com/sgl-project/sglang/pull/9687) for reference changes.

	#### [SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)

	* If using an external draft model results in illegal memory access, it may be caused by a context length mismatch between the draft and target models.
	* Please update to SGLang ≥ 0.5.1 (and update `sgl-kernel`) to apply this fix.