shinka-backup / ccevolve /baselines /thetaevolve /docs /en /advanced /speculative-decoding.md
JustinTX's picture
Add files using upload-large-folder tool
d7b3a74 verified

Speculative Decoding

Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.

For model with MTP layer (e.g. GLM-4.6, Deepseek-V3/R1), you can run with:

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4

And for external draft model (e.g. draft models from SpecForge), you need also pass:

--speculative-draft-model-path /your/draft/model/path

For details on parameter meanings and configuration, see the SGLang speculative decoding documentation.

Known Issues

SGLang issue #9888 or SGLang issue #9521

  • Error occurs during CUDA graph padding in the speculative decoding draft stage.

  • Workarounds:

    1. Switch the inference backend to fa3 Triton (bug only occurs in flashInfer).
    2. Specify a broader range for --sglang-cuda-graph-bs to avoid batch sizes that trigger CUDA graph padding.
    3. Disable CUDA graph (not recommended due to significant performance loss).
    4. Notice: Disabling CUDA graph padding with --sglang-disable-cuda-graph-padding is currently ineffective for speculative decoding. See SGLang cuda_graph_runner.py.
  • For debugging, enable slime’s --debug-rollout-only flag to isolate rollout behavior from parameter updates or model offloading.

# If speculative decoding fails, this can help debug
--debug-rollout-only

# If flashInfer causes issues with speculative decoding, use fa3 or triton instead
--sglang-attention-backend fa3

# If CUDA graph fails due to padding, extend the CUDA graph batch size
--sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)

# Improve performance by enlarging the running batch size limit
--sglang-max-running-requests 128

SGLang issue #9481

  • Solution:

    1. Apply the latest SGLang patch.
    2. See PR #9687 for reference changes.

SGLang PR #9388

  • If using an external draft model results in illegal memory access, it may be caused by a context length mismatch between the draft and target models.
  • Please update to SGLang ≥ 0.5.1 (and update sgl-kernel) to apply this fix.