[Bug] Eagle V2 speculative decoding crashes with NaN in logits when radix cache prefix hit occurs (SM120 / RTX PRO 6000 Blackwell)

by repandv - opened Mar 3

Mar 3

Environment

Component	Version
Container	`lmsysorg/sglang:glm5-blackwell` (sha256:968b8bc5f67c)
Model	festr2/GLM-5-NVFP4-MTP
GPU	8× NVIDIA RTX PRO 6000 Blackwell 96GB (SM120, TP=8)
CUDA	12.9.1
Quantization	modelopt_fp4
KV cache dtype	bfloat16 (auto-set by server_args.py for DSA on SM12)
Speculative algorithm	NEXTN (mapped to EAGLE / eagle_worker_v2)

Launch command

python3 -m sglang.launch_server \
  --model-path /model \
  --served-model-name GLM-5 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype auto \
  --tensor-parallel-size 8 \
  --attention-backend flashinfer \
  --moe-runner-backend flashinfer_cutlass \
  --disable-custom-all-reduce \
  --enable-flashinfer-allreduce-fusion \
  --speculative-algorithm NEXTN \
  --speculative-num-steps 1 \
  --speculative-num-draft-tokens 1 \
  --speculative-eagle-topk 1 \
  --reasoning-parser glm45 \
  --tool-call-parser glm47 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000 \
  --mem-fraction-static 0.92 \
  --max-running-requests 8 \
  --enable-nan-detection \
  --watchdog-timeout 600

(env: SGLANG_ENABLE_SPEC_V2=True, SGLANG_ENABLE_JIT_DEEPGEMM=0, SGLANG_ENABLE_DEEP_GEMM=0)

Bug description

Server starts successfully, CUDA graphs captured, speculative decoding works fine on the first requests (accept rate ~0.85–0.94, ~50 tok/s).

Crash occurs consistently when a subsequent request hits the radix cache prefix (#cached-token > 0). On the first such request, eagle_worker_v2.py raises NaN in the logits and all TP workers crash simultaneously.

Crash traceback

[2026-03-03 18:14:33 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File ".../sglang/srt/managers/scheduler.py", line 3076, in run_scheduler_process
    scheduler.event_loop_overlap()
  File ".../scheduler.py", line 1123, in event_loop_overlap
    batch_result = self.run_batch(batch)
  File ".../scheduler.py", line 2279, in run_batch
    batch_result = self.model_worker.forward_batch_generation(...)
  File ".../sglang/srt/speculative/eagle_worker_v2.py", line 675, in forward_batch_generation
    batch_output = self.verify(model_worker_batch)
  File ".../sglang/srt/speculative/eagle_worker_v2.py", line 765, in verify
    detect_nan(logits_output)
  File ".../sglang/srt/speculative/spec_utils.py", line 713, in detect_nan
    raise ValueError("Detected errors during sampling! NaN in the logits.")
ValueError: Detected errors during sampling! NaN in the logits.

(all 8 TP workers crash at the same timestamp)

Reproduction pattern

The crash is 100% reproducible with this sequence:

Send request A (input ~200 tokens) → completes normally, accept_rate ~0.9
Send request B (input ~3000 tokens, no cache overlap) → completes normally
Send request C (input ~4400 tokens, overlaps with previous context → #cached-token=2688) → crash on step 2 of decode

Log line immediately before crash:

Prefill batch, #new-seq: 1, #new-token: 1792, #cached-token: 2688, ...
Decode batch, #running-req: 1, #token: 4480, accept len: 1.82, accept rate: 0.91, ...
Decode batch, #running-req: 1, #token: 4544, accept len: 1.75, accept rate: 0.88, ...
→ CRASH

Key observation

Without speculative decoding (same hardware, same model, same NSA patches) the server is fully stable:

Handled 15,000+ token contexts without issues
No NaN, no crashes across dozens of requests

This strongly suggests the bug is in Eagle V2 verify path when processing a batch where KV cache was partially populated from radix cache prefix, not in the base model or attention backend.

Additional context

SM120 (Blackwell) requires several patches to run GLM-5-NVFP4-MTP:
- KV cache dtype forced to bfloat16 (fp8_e4m3 unsupported for DSA on SM12)
- NSA backends patched from flashmla_sparse/trtllm → flashinfer (SM90/SM100-only backends produce NaN on SM120)
- DeepGemm disabled (SGLANG_ENABLE_JIT_DEEPGEMM=0, SGLANG_ENABLE_DEEP_GEMM=0) — separate numerical instability issue on Blackwell
speculative_algorithm is set to NEXTN in CLI but gets mapped to EAGLE internally with draft_model_path = target_model_path (MTP heads reused)
SGLANG_ENABLE_SPEC_V2=True is required, otherwise NEXTN falls back silently to non-v2 Eagle and loads the full model twice → OOM

Workaround

Disabling speculative decoding entirely resolves the crash. Running without MTP: ~33 tok/s. With MTP (before crash): ~50 tok/s, accept_rate 0.80–0.94.

Would be happy to provide full logs or test patches. This hardware (SM120) is fully available for testing.

festr2

Owner Mar 3

Environment

hello, would you join our discord: https://discord.gg/FJye6yaWN3

I'm curious about -

KV cache dtype forced to bfloat16 (fp8_e4m3 unsupported for DSA on SM12) - what patch?
NSA backends patched from - what patch?

why you do not use the default recommended mtp: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Why SGLANG_ENABLE_SPEC_V2=True ? Is it stable enough? I'm running it without

Lets discuss on discord

Michalea

Mar 17

@repandv

Did you solve your problem?:-)

festr2

Owner Mar 17

yes, its flashinfer bug, not merged. You can use my docker voipmonitor/sglang:dev-cu130 or voipmonitor/sglang:dev-cu132

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment