Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Hellohal2064Β 
posted an update Jan 5
Post
1618
πŸš€ Excited to share: The vLLM container for NVIDIA DGX Spark!

I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp!

πŸ“Š Performance Highlights:
β€’ Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp)
β€’ Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp)

πŸ”§ Technical Challenges Solved:
β€’ Built PyTorch nightly with CUDA 13.1 + SM121 support
β€’ Patched vLLM for Blackwell architecture
β€’ Created custom MoE expert configs for GB10
β€’ Implemented TRITON_ATTN backend workaround

πŸ“¦ Available now:
β€’ Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest
β€’ HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10

The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!

Definitely going to check this out! I've been using llama.cpp on my spark but 2.5x inference speed up is huge. Thanks for sharing!

Β·

Please try it out :)let me know if you run into any problems. I will most likely be uploading a new image sometime this week. Working on some other improvements around the qwen-next Models

Hi! I'm also a developer working on enterprise solutions with the NVIDIA DGX Spark. I'd love to connect as I have loads of questions and my own solutions that I'd love to workshop with you on!

Β·

I am US CST time. you can reach out to me at 971-708-9761. My AI system will ask you to share what your calling about, just say DGX spark AI.