PyTorch 2.12.0a0 β€” Custom Build for NVIDIA GB10 (sm_121 / CUDA 13.0)

Custom PyTorch wheel with native sm_121 CUDA support for the NVIDIA GB10 GPU (DGX Spark, Blackwell architecture).

Why?

Pre-built PyTorch wheels from PyPI (both cu128 stable and cu130 nightly) compile CUDA kernels up to sm_120. The GB10 has compute capability sm_121, which is not included. Simple CUDA operations work via PTX forward compatibility, but complex model inference β€” particularly diffusion sampling in tools like Boltz-2 β€” hangs indefinitely during PTX JIT compilation.

Install

pip install https://huggingface.co/Qanatpharma/pytorch-sm121-gb10/resolve/main/torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl

Or download and install locally:

pip install torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl --force-reinstall --no-deps

Required runtime environment

# Triton's bundled ptxas doesn't support sm_121a β€” use the system one
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

# Enable Tensor Core matmul for better performance
export TORCH_FLOAT32_MATMUL_PRECISION=high

For Boltz-2 specifically

pip install cuequivariance-torch cuequivariance-ops-torch-cu12 --extra-index-url https://pypi.nvidia.com

Build Details

Field Value
PyTorch version 2.12.0a0 (main branch, commit b071fd7)
CUDA 13.0
CUDA arch sm_121 only
Python 3.12
Platform linux_aarch64 (ARM64)
Build flags USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0
Build time ~2 hours on DGX Spark (20 ARM cores)
Wheel size 168 MB

Verification

import torch
props = torch.cuda.get_device_properties(0)
cc = f"sm_{props.major}{props.minor}"
print(f"Device: {props.name}")                      # NVIDIA GB10
print(f"Compute capability: {cc}")                  # sm_121
print(f"PyTorch archs: {torch.cuda.get_arch_list()}")  # ['sm_121']
print(f"Match: {cc in torch.cuda.get_arch_list()}")    # True

Benchmark: Boltz-2 HP35 Prediction

Metric Pre-built torch (hangs) NIM container This wheel
Time ∞ 4.1s 7s
GPU utilization 0% 76% 81%
Peak power 9W (idle) 21W 38.7W
Confidence N/A 0.93 0.93

Limitations

  • Built for sm_121 only β€” won't work on other GPU architectures
  • No NCCL / distributed β€” single-GPU only (appropriate for DGX Spark)
  • Python 3.12 only β€” rebuild for other Python versions
  • Flash-attention Hopper (sm_90) instantiations are compiled but unused

How to Rebuild

git clone --recursive --depth 1 https://github.com/pytorch/pytorch
cd pytorch
python -m venv .venv && source .venv/bin/activate
pip install pyyaml typing-extensions numpy cmake ninja setuptools wheel

TORCH_CUDA_ARCH_LIST="12.1" \
USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0 USE_MPI=0 \
MAX_JOBS=$(nproc) BUILD_TEST=0 \
python setup.py bdist_wheel

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support