vLLM for DGX Spark (Blackwell GB10)

Optimized vLLM Docker image for running Nemotron3-Nano and other models on NVIDIA DGX Spark with CUDA graphs enabled.

Credits

Model: cybermotaz/nemotron3-nano-nvfp4-w4a16 - NVFP4 quantization by @cybermotaz
Original Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 by NVIDIA
This Docker Image: Resolves DGX Spark (GB10/SM12.1) build and runtime issues with the avarok/vllm-dgx-spark Docker image

Performance

Mode	Throughput
Eager mode (`--enforce-eager`)	~42 tok/s
CUDA graphs enabled	~66-67 tok/s

~60% speedup with CUDA graphs on DGX Spark GB10!

Quick Start (One-Liner)

docker run --rm -it --gpus all --ipc=host -p 8000:8000 -e VLLM_FLASHINFER_MOE_BACKEND=latency -v ~/.cache/huggingface:/root/.cache/huggingface avarok/vllm-dgx-spark:v11 serve cybermotaz/nemotron3-nano-nvfp4-w4a16 --quantization modelopt_fp4 --kv-cache-dtype fp8 --trust-remote-code --max-model-len 131072 --gpu-memory-utilization 0.85

Then test with:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"cybermotaz/nemotron3-nano-nvfp4-w4a16","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

What This Image Fixes

This image solves several compatibility issues when running vLLM on DGX Spark (Blackwell GB10, SM12.1):

Issue	Solution
Non-gated activations (ReLU²) not supported	Built from vLLM main branch with PR #29004
CUDA architecture mismatch	Built with `TORCH_CUDA_ARCH_LIST="12.1f"` for GB10
SM120 CUTLASS kernel failures	Uses `VLLM_FLASHINFER_MOE_BACKEND=latency`
FP4/scaled_mm kernel issues	CMakeLists patch to restrict to SM10.0
CUDA 13.0 compatibility	Full CUDA 13.0 + PyTorch cu130 support

Docker Image

docker pull avarok/vllm-dgx-spark:v11

Image size: ~27GB

Building From Source

If you prefer to build the image yourself:

git clone https://huggingface.co/avarok/vllm-dgx-spark
cd vllm-dgx-spark
docker build -t vllm-dgx-spark:v11 .

Build time: ~45-60 minutes on DGX Spark

Environment Variables

Variable	Value	Description
`VLLM_FLASHINFER_MOE_BACKEND`	`latency`	Required for SM12.1 compatibility
`VLLM_USE_V1`	`1` (default)	Use V1 engine
`VLLM_ATTENTION_BACKEND`	`FLASHINFER` (default)	FlashInfer attention
`VLLM_CUDA_GRAPH_MODE`	`full_and_piecewise` (default)	CUDA graph mode

Full Run Command

docker run -d --name vllm-nemotron \
  --gpus all --ipc=host -p 8000:8000 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/vllm-dgx-spark:v11 \
  serve cybermotaz/nemotron3-nano-nvfp4-w4a16 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser deepseek_r1

Startup Time

First startup takes ~8-10 minutes due to:

torch.compile (~5 min)
FlashInfer autotuning (~2 min)
CUDA graph capture (~1 min)

Subsequent startups with cached compilation are faster.

Hardware Requirements

NVIDIA DGX Spark with GB10 GPU (SM12.1, Blackwell architecture)
128GB unified memory
CUDA 13.0+

Troubleshooting

"Failed to initialize cutlass TMA WS grouped gemm"

Make sure you're using -e VLLM_FLASHINFER_MOE_BACKEND=latency. The throughput backend has SM120 kernel issues on SM12.1.

Memory errors

Reduce --gpu-memory-utilization to 0.75 or lower, or reduce --max-model-len.

Slow performance (~42 tok/s instead of ~67 tok/s)

Check that CUDA graphs are enabled (no --enforce-eager flag) and startup completed successfully. Look for "Capturing CUDA graphs" in the logs.

Files in This Repo

Dockerfile - Reproducible build for vLLM on DGX Spark
vllm_cmakelists.patch - Patch for SM12.x kernel compatibility
README.md - This file

License

Apache 2.0

Avarok
/

vllm-dgx-spark