vLLM for DGX Spark (Blackwell GB10)

Optimized vLLM Docker image for running Nemotron3-Nano and other models on NVIDIA DGX Spark with CUDA graphs enabled.

Credits

Performance

Mode Throughput
Eager mode (--enforce-eager) ~42 tok/s
CUDA graphs enabled ~66-67 tok/s

~60% speedup with CUDA graphs on DGX Spark GB10!

Quick Start (One-Liner)

docker run --rm -it --gpus all --ipc=host -p 8000:8000 -e VLLM_FLASHINFER_MOE_BACKEND=latency -v ~/.cache/huggingface:/root/.cache/huggingface avarok/vllm-dgx-spark:v11 serve cybermotaz/nemotron3-nano-nvfp4-w4a16 --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 131072 --gpu-memory-utilization 0.85

Then test with:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"cybermotaz/nemotron3-nano-nvfp4-w4a16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

What This Image Fixes

This image solves several compatibility issues when running vLLM on DGX Spark (Blackwell GB10, SM12.1):

Issue Solution
Non-gated activations (ReLUยฒ) not supported Built from vLLM main branch with PR #29004
CUDA architecture mismatch Built with TORCH_CUDA_ARCH_LIST="12.1f" for GB10
SM120 CUTLASS kernel failures Uses VLLM_FLASHINFER_MOE_BACKEND=latency
FP4/scaled_mm kernel issues CMakeLists patch to restrict to SM10.0
CUDA 13.0 compatibility Full CUDA 13.0 + PyTorch cu130 support

Docker Image

docker pull avarok/vllm-dgx-spark:v11

Image size: ~27GB

Building From Source

If you prefer to build the image yourself:

git clone https://huggingface.co/avarok/vllm-dgx-spark
cd vllm-dgx-spark
docker build -t vllm-dgx-spark:v11 .

Build time: ~45-60 minutes on DGX Spark

Environment Variables

Variable Value Description
VLLM_FLASHINFER_MOE_BACKEND latency Required for SM12.1 compatibility
VLLM_USE_V1 1 (default) Use V1 engine
VLLM_ATTENTION_BACKEND FLASHINFER (default) FlashInfer attention
VLLM_CUDA_GRAPH_MODE full_and_piecewise (default) CUDA graph mode

Full Run Command

docker run -d --name vllm-nemotron \
  --gpus all --ipc=host -p 8000:8000 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/vllm-dgx-spark:v11 \
  serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser deepseek_r1

Startup Time

First startup takes ~8-10 minutes due to:

  • torch.compile (~5 min)
  • FlashInfer autotuning (~2 min)
  • CUDA graph capture (~1 min)

Subsequent startups with cached compilation are faster.

Hardware Requirements

  • NVIDIA DGX Spark with GB10 GPU (SM12.1, Blackwell architecture)
  • 128GB unified memory
  • CUDA 13.0+

Troubleshooting

"Failed to initialize cutlass TMA WS grouped gemm"

Make sure you're using -e VLLM_FLASHINFER_MOE_BACKEND=latency. The throughput backend has SM120 kernel issues on SM12.1.

Memory errors

Reduce --gpu-memory-utilization to 0.75 or lower, or reduce --max-model-len.

Slow performance (~42 tok/s instead of ~67 tok/s)

Check that CUDA graphs are enabled (no --enforce-eager flag) and startup completed successfully. Look for "Capturing CUDA graphs" in the logs.

Files in This Repo

  • Dockerfile - Reproducible build for vLLM on DGX Spark
  • vllm_cmakelists.patch - Patch for SM12.x kernel compatibility
  • README.md - This file

License

Apache 2.0

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support