vLLM for DGX Spark (Blackwell GB10)
Optimized vLLM Docker image for running Nemotron3-Nano and other models on NVIDIA DGX Spark with CUDA graphs enabled.
Credits
- Model: cybermotaz/nemotron3-nano-nvfp4-w4a16 - NVFP4 quantization by @cybermotaz
- Original Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 by NVIDIA
- This Docker Image: Resolves DGX Spark (GB10/SM12.1) build and runtime issues with the
avarok/vllm-dgx-sparkDocker image
Performance
| Mode | Throughput |
|---|---|
Eager mode (--enforce-eager) |
~42 tok/s |
| CUDA graphs enabled | ~66-67 tok/s |
~60% speedup with CUDA graphs on DGX Spark GB10!
Quick Start (One-Liner)
docker run --rm -it --gpus all --ipc=host -p 8000:8000 -e VLLM_FLASHINFER_MOE_BACKEND=latency -v ~/.cache/huggingface:/root/.cache/huggingface avarok/vllm-dgx-spark:v11 serve cybermotaz/nemotron3-nano-nvfp4-w4a16 --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 131072 --gpu-memory-utilization 0.85
Then test with:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"cybermotaz/nemotron3-nano-nvfp4-w4a16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'
What This Image Fixes
This image solves several compatibility issues when running vLLM on DGX Spark (Blackwell GB10, SM12.1):
| Issue | Solution |
|---|---|
| Non-gated activations (ReLUยฒ) not supported | Built from vLLM main branch with PR #29004 |
| CUDA architecture mismatch | Built with TORCH_CUDA_ARCH_LIST="12.1f" for GB10 |
| SM120 CUTLASS kernel failures | Uses VLLM_FLASHINFER_MOE_BACKEND=latency |
| FP4/scaled_mm kernel issues | CMakeLists patch to restrict to SM10.0 |
| CUDA 13.0 compatibility | Full CUDA 13.0 + PyTorch cu130 support |
Docker Image
docker pull avarok/vllm-dgx-spark:v11
Image size: ~27GB
Building From Source
If you prefer to build the image yourself:
git clone https://huggingface.co/avarok/vllm-dgx-spark
cd vllm-dgx-spark
docker build -t vllm-dgx-spark:v11 .
Build time: ~45-60 minutes on DGX Spark
Environment Variables
| Variable | Value | Description |
|---|---|---|
VLLM_FLASHINFER_MOE_BACKEND |
latency |
Required for SM12.1 compatibility |
VLLM_USE_V1 |
1 (default) |
Use V1 engine |
VLLM_ATTENTION_BACKEND |
FLASHINFER (default) |
FlashInfer attention |
VLLM_CUDA_GRAPH_MODE |
full_and_piecewise (default) |
CUDA graph mode |
Full Run Command
docker run -d --name vllm-nemotron \
--gpus all --ipc=host -p 8000:8000 \
-e VLLM_FLASHINFER_MOE_BACKEND=latency \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/vllm-dgx-spark:v11 \
serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek_r1
Startup Time
First startup takes ~8-10 minutes due to:
- torch.compile (~5 min)
- FlashInfer autotuning (~2 min)
- CUDA graph capture (~1 min)
Subsequent startups with cached compilation are faster.
Hardware Requirements
- NVIDIA DGX Spark with GB10 GPU (SM12.1, Blackwell architecture)
- 128GB unified memory
- CUDA 13.0+
Troubleshooting
"Failed to initialize cutlass TMA WS grouped gemm"
Make sure you're using -e VLLM_FLASHINFER_MOE_BACKEND=latency. The throughput backend has SM120 kernel issues on SM12.1.
Memory errors
Reduce --gpu-memory-utilization to 0.75 or lower, or reduce --max-model-len.
Slow performance (~42 tok/s instead of ~67 tok/s)
Check that CUDA graphs are enabled (no --enforce-eager flag) and startup completed successfully. Look for "Capturing CUDA graphs" in the logs.
Files in This Repo
Dockerfile- Reproducible build for vLLM on DGX Sparkvllm_cmakelists.patch- Patch for SM12.x kernel compatibilityREADME.md- This file
License
Apache 2.0