Crash with NVFP4 GLM 5.2 weights

by giorgiopiatti-dfinity - opened 2 days ago

Environment & config: We deployed the Red Hat DSpark recipe on a 4-GPU (TP=4) shadow canary using vLLM nightly-2dfaae752b4db0d43cfc0715c780e33be030d0f1 (v0.23.1rc1.dev748). Speculative config matched the HF card: {"model": "RedHatAI/GLM-5.2-speculator.dspark", "num_speculative_tokens": 7, "method": "dspark", "draft_sample_method": "probabilistic"}. We kept NVFP4 main weights (eigen-ai-labs/GLM-5.2-NVFP4) instead of FP8, and omitted --kv-cache-dtype fp8_e4m3 so KV cache stays at default bf16. Startup progressed normally through main model load (~108 GiB), DSpark speculator download/load (Eagle3 aux layers (8, 23, 39, 55, 70)), dspark_head torch.compile, KV cache allocation (513K tokens, bf16), and FlashInfer autotune — with query/decode/KV dtypes all aligned to bfloat16.

Failure: The pod consistently crashes during final kernel warmup in compile_or_warm_up_model → warmup_kernels → expand_idx_mapping → Triton _expand_idx_mapping_kernel, with CUDA error: an illegal memory access was encountered (cudaErrorIllegalAddress). We saw the same stack on an earlier nightly (09663abde, dev714) after fixing a separate issue where --kv-cache-dtype fp8_e4m3 caused a dspark CUDA graph dtype mismatch (expected bfloat16, got float8_e4m3fn). The Jul 3 nightly did not fix the warmup crash. Red Hat’s validated setup uses zai-org/GLM-5.2-FP8 on v0.23.1rc1.dev709; we have not yet retried with FP8 weights. Happy to share full pod logs or open a vLLM upstream issue if useful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment