@Hellohal2064 on Hugging Face: "🚀 Excited to share: The vLLM container for NVIDIA DGX Spark! I've been…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Hellohal2064

posted an update Jan 5

Post

1618

🚀 Excited to share: The vLLM container for NVIDIA DGX Spark!

I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp!

📊 Performance Highlights:
• Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp)
• Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp)

🔧 Technical Challenges Solved:
• Built PyTorch nightly with CUDA 13.1 + SM121 support
• Patched vLLM for Blackwell architecture
• Created custom MoE expert configs for GB10
• Implemented TRITON_ATTN backend workaround

📦 Available now:
• Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest
• HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10

The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!

unmodeled-tyler

Jan 5

Definitely going to check this out! I've been using llama.cpp on my spark but 2.5x inference speed up is huge. Thanks for sharing!

Hellohal2064

Jan 6

•

edited Jan 6

Please try it out :)let me know if you run into any problems. I will most likely be uploading a new image sometime this week. Working on some other improvements around the qwen-next Models

SourBNulink

Jan 6

Hi! I'm also a developer working on enterprise solutions with the NVIDIA DGX Spark. I'd love to connect as I have loads of questions and my own solutions that I'd love to workshop with you on!

Hellohal2064

Jan 6

I am US CST time. you can reach out to me at 971-708-9761. My AI system will ask you to share what your calling about, just say DGX spark AI.

In this post