PyTorch 2.12.0a0 — Custom Build for NVIDIA GB10 (sm_121 / CUDA 13.0)

Custom PyTorch wheel with native sm_121 CUDA support for the NVIDIA GB10 GPU (DGX Spark, Blackwell architecture).

Why?

Pre-built PyTorch wheels from PyPI (both cu128 stable and cu130 nightly) compile CUDA kernels up to sm_120. The GB10 has compute capability sm_121, which is not included. Simple CUDA operations work via PTX forward compatibility, but complex model inference — particularly diffusion sampling in tools like Boltz-2 — hangs indefinitely during PTX JIT compilation.

Install

pip install https://huggingface.co/Qanatpharma/pytorch-sm121-gb10/resolve/main/torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl

Or download and install locally:

pip install torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl --force-reinstall --no-deps

Required runtime environment

# Triton's bundled ptxas doesn't support sm_121a — use the system one
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

# Enable Tensor Core matmul for better performance
export TORCH_FLOAT32_MATMUL_PRECISION=high

For Boltz-2 specifically

pip install cuequivariance-torch cuequivariance-ops-torch-cu12 --extra-index-url https://pypi.nvidia.com

Build Details

Field	Value
PyTorch version	2.12.0a0 (main branch, commit b071fd7)
CUDA	13.0
CUDA arch	sm_121 only
Python	3.12
Platform	linux_aarch64 (ARM64)
Build flags	`USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0`
Build time	~2 hours on DGX Spark (20 ARM cores)
Wheel size	168 MB

Verification

import torch
props = torch.cuda.get_device_properties(0)
cc = f"sm_{props.major}{props.minor}"
print(f"Device: {props.name}")                      # NVIDIA GB10
print(f"Compute capability: {cc}")                  # sm_121
print(f"PyTorch archs: {torch.cuda.get_arch_list()}")  # ['sm_121']
print(f"Match: {cc in torch.cuda.get_arch_list()}")    # True

Benchmark: Boltz-2 HP35 Prediction

Metric	Pre-built torch (hangs)	NIM container	This wheel
Time	∞	4.1s	7s
GPU utilization	0%	76%	81%
Peak power	9W (idle)	21W	38.7W
Confidence	N/A	0.93	0.93

Limitations

Built for sm_121 only — won't work on other GPU architectures
No NCCL / distributed — single-GPU only (appropriate for DGX Spark)
Python 3.12 only — rebuild for other Python versions
Flash-attention Hopper (sm_90) instantiations are compiled but unused

How to Rebuild

git clone --recursive --depth 1 https://github.com/pytorch/pytorch
cd pytorch
python -m venv .venv && source .venv/bin/activate
pip install pyyaml typing-extensions numpy cmake ninja setuptools wheel

TORCH_CUDA_ARCH_LIST="12.1" \
USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0 USE_MPI=0 \
MAX_JOBS=$(nproc) BUILD_TEST=0 \
python setup.py bdist_wheel

Qanatpharma
/

pytorch-sm121-gb10