[Bug] Eagle V2 speculative decoding crashes with NaN in logits when radix cache prefix hit occurs (SM120 / RTX PRO 6000 Blackwell)
Environment
| Component | Version |
|---|---|
| Container | lmsysorg/sglang:glm5-blackwell (sha256:968b8bc5f67c) |
| Model | festr2/GLM-5-NVFP4-MTP |
| GPU | 8Γ NVIDIA RTX PRO 6000 Blackwell 96GB (SM120, TP=8) |
| CUDA | 12.9.1 |
| Quantization | modelopt_fp4 |
| KV cache dtype | bfloat16 (auto-set by server_args.py for DSA on SM12) |
| Speculative algorithm | NEXTN (mapped to EAGLE / eagle_worker_v2) |
Launch command
python3 -m sglang.launch_server \
--model-path /model \
--served-model-name GLM-5 \
--quantization modelopt_fp4 \
--kv-cache-dtype auto \
--tensor-parallel-size 8 \
--attention-backend flashinfer \
--moe-runner-backend flashinfer_cutlass \
--disable-custom-all-reduce \
--enable-flashinfer-allreduce-fusion \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-num-draft-tokens 1 \
--speculative-eagle-topk 1 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--mem-fraction-static 0.92 \
--max-running-requests 8 \
--enable-nan-detection \
--watchdog-timeout 600
(env: SGLANG_ENABLE_SPEC_V2=True, SGLANG_ENABLE_JIT_DEEPGEMM=0, SGLANG_ENABLE_DEEP_GEMM=0)
Bug description
Server starts successfully, CUDA graphs captured, speculative decoding works fine on the first requests (accept rate ~0.85β0.94, ~50 tok/s).
Crash occurs consistently when a subsequent request hits the radix cache prefix (#cached-token > 0). On the first such request, eagle_worker_v2.py raises NaN in the logits and all TP workers crash simultaneously.
Crash traceback
[2026-03-03 18:14:33 TP1] Scheduler hit an exception: Traceback (most recent call last):
File ".../sglang/srt/managers/scheduler.py", line 3076, in run_scheduler_process
scheduler.event_loop_overlap()
File ".../scheduler.py", line 1123, in event_loop_overlap
batch_result = self.run_batch(batch)
File ".../scheduler.py", line 2279, in run_batch
batch_result = self.model_worker.forward_batch_generation(...)
File ".../sglang/srt/speculative/eagle_worker_v2.py", line 675, in forward_batch_generation
batch_output = self.verify(model_worker_batch)
File ".../sglang/srt/speculative/eagle_worker_v2.py", line 765, in verify
detect_nan(logits_output)
File ".../sglang/srt/speculative/spec_utils.py", line 713, in detect_nan
raise ValueError("Detected errors during sampling! NaN in the logits.")
ValueError: Detected errors during sampling! NaN in the logits.
(all 8 TP workers crash at the same timestamp)
Reproduction pattern
The crash is 100% reproducible with this sequence:
- Send request A (input ~200 tokens) β completes normally, accept_rate ~0.9
- Send request B (input ~3000 tokens, no cache overlap) β completes normally
- Send request C (input ~4400 tokens, overlaps with previous context β
#cached-token=2688) β crash on step 2 of decode
Log line immediately before crash:
Prefill batch, #new-seq: 1, #new-token: 1792, #cached-token: 2688, ...
Decode batch, #running-req: 1, #token: 4480, accept len: 1.82, accept rate: 0.91, ...
Decode batch, #running-req: 1, #token: 4544, accept len: 1.75, accept rate: 0.88, ...
β CRASH
Key observation
Without speculative decoding (same hardware, same model, same NSA patches) the server is fully stable:
- Handled 15,000+ token contexts without issues
- No NaN, no crashes across dozens of requests
This strongly suggests the bug is in Eagle V2 verify path when processing a batch where KV cache was partially populated from radix cache prefix, not in the base model or attention backend.
Additional context
- SM120 (Blackwell) requires several patches to run GLM-5-NVFP4-MTP:
- KV cache dtype forced to bfloat16 (fp8_e4m3 unsupported for DSA on SM12)
- NSA backends patched from
flashmla_sparse/trtllmβflashinfer(SM90/SM100-only backends produce NaN on SM120) - DeepGemm disabled (
SGLANG_ENABLE_JIT_DEEPGEMM=0,SGLANG_ENABLE_DEEP_GEMM=0) β separate numerical instability issue on Blackwell
speculative_algorithmis set toNEXTNin CLI but gets mapped toEAGLEinternally withdraft_model_path = target_model_path(MTP heads reused)SGLANG_ENABLE_SPEC_V2=Trueis required, otherwise NEXTN falls back silently to non-v2 Eagle and loads the full model twice β OOM
Workaround
Disabling speculative decoding entirely resolves the crash. Running without MTP: ~33 tok/s. With MTP (before crash): ~50 tok/s, accept_rate 0.80β0.94.
Would be happy to provide full logs or test patches. This hardware (SM120) is fully available for testing.
Environment
hello, would you join our discord: https://discord.gg/FJye6yaWN3
I'm curious about -
KV cache dtype forced to bfloat16 (fp8_e4m3 unsupported for DSA on SM12) - what patch?
NSA backends patched from - what patch?
- why you do not use the default recommended mtp: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
Why SGLANG_ENABLE_SPEC_V2=True ? Is it stable enough? I'm running it without
Lets discuss on discord