PyTorch 2.12.0a0 β Custom Build for NVIDIA GB10 (sm_121 / CUDA 13.0)
Custom PyTorch wheel with native sm_121 CUDA support for the NVIDIA GB10 GPU (DGX Spark, Blackwell architecture).
Why?
Pre-built PyTorch wheels from PyPI (both cu128 stable and cu130 nightly) compile CUDA kernels up to sm_120. The GB10 has compute capability sm_121, which is not included. Simple CUDA operations work via PTX forward compatibility, but complex model inference β particularly diffusion sampling in tools like Boltz-2 β hangs indefinitely during PTX JIT compilation.
Install
pip install https://huggingface.co/Qanatpharma/pytorch-sm121-gb10/resolve/main/torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl
Or download and install locally:
pip install torch-2.12.0a0+gitb071fd7-cp312-cp312-linux_aarch64.whl --force-reinstall --no-deps
Required runtime environment
# Triton's bundled ptxas doesn't support sm_121a β use the system one
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
# Enable Tensor Core matmul for better performance
export TORCH_FLOAT32_MATMUL_PRECISION=high
For Boltz-2 specifically
pip install cuequivariance-torch cuequivariance-ops-torch-cu12 --extra-index-url https://pypi.nvidia.com
Build Details
| Field | Value |
|---|---|
| PyTorch version | 2.12.0a0 (main branch, commit b071fd7) |
| CUDA | 13.0 |
| CUDA arch | sm_121 only |
| Python | 3.12 |
| Platform | linux_aarch64 (ARM64) |
| Build flags | USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0 |
| Build time | ~2 hours on DGX Spark (20 ARM cores) |
| Wheel size | 168 MB |
Verification
import torch
props = torch.cuda.get_device_properties(0)
cc = f"sm_{props.major}{props.minor}"
print(f"Device: {props.name}") # NVIDIA GB10
print(f"Compute capability: {cc}") # sm_121
print(f"PyTorch archs: {torch.cuda.get_arch_list()}") # ['sm_121']
print(f"Match: {cc in torch.cuda.get_arch_list()}") # True
Benchmark: Boltz-2 HP35 Prediction
| Metric | Pre-built torch (hangs) | NIM container | This wheel |
|---|---|---|---|
| Time | β | 4.1s | 7s |
| GPU utilization | 0% | 76% | 81% |
| Peak power | 9W (idle) | 21W | 38.7W |
| Confidence | N/A | 0.93 | 0.93 |
Limitations
- Built for sm_121 only β won't work on other GPU architectures
- No NCCL / distributed β single-GPU only (appropriate for DGX Spark)
- Python 3.12 only β rebuild for other Python versions
- Flash-attention Hopper (sm_90) instantiations are compiled but unused
How to Rebuild
git clone --recursive --depth 1 https://github.com/pytorch/pytorch
cd pytorch
python -m venv .venv && source .venv/bin/activate
pip install pyyaml typing-extensions numpy cmake ninja setuptools wheel
TORCH_CUDA_ARCH_LIST="12.1" \
USE_CUDA=1 USE_CUDNN=1 USE_NCCL=0 USE_DISTRIBUTED=0 USE_MPI=0 \
MAX_JOBS=$(nproc) BUILD_TEST=0 \
python setup.py bdist_wheel
See Also
- karpathy/autoresearch#30 β SDPA fallback approach for sm_121
- Boltz-2 NIM β pre-built container that works on GB10
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support