onnxruntime-gpu for aarch64 + CUDA 13 + SM121 (DGX Spark / Jetson Thor)

針對 NVIDIA DGX Spark (GB10 Blackwell) 和 Jetson Thor 預先編譯的 onnxruntime-gpu wheel — 目前公開可取得的第一個適用於此平台的版本。

Pre-built onnxruntime-gpu wheel for NVIDIA DGX Spark (GB10 Blackwell) and Jetson Thor — the first publicly available wheel for this platform.

為什麼需要這個 / Why This Exists

截至 2026 年 3 月，官方 PyPI 沒有 aarch64 + CUDA 13 的 onnxruntime-gpu wheel，只提供 x86_64 版本。這個 wheel 是專門為 NVIDIA GB10 (SM121) Blackwell 架構從源碼編譯的。

As of March 2026, there is no official onnxruntime-gpu wheel for aarch64 + CUDA 13 on PyPI (x86_64 only). This wheel was compiled from source specifically for the NVIDIA GB10 (SM121) Blackwell architecture.

規格 / Specifications

項目 / Item	值 / Value
版本 / Version	1.25.0 (從 main branch 編譯)
Python	3.12 (cp312)
平台 / Platform	linux_aarch64
CUDA	13.1 (向前相容 13.0 driver)
cuDNN	9.17.1
CUDA Arch	SM121 (Blackwell)
Providers	CUDAExecutionProvider, CPUExecutionProvider

安裝方式 / Installation

pip install https://huggingface.co/Jay0515/onnxruntime-gpu-aarch64-cuda13-sm121/resolve/main/onnxruntime_gpu-1.25.0-cp312-cp312-linux_aarch64.whl

cuDNN 需求 / cuDNN Requirement

系統需要有 cuDNN 9.x。如果沒有系統級安裝，可以用 pip：

pip install nvidia-cudnn-cu12
export LD_LIBRARY_PATH=$(python -c "import nvidia.cudnn; print(nvidia.cudnn.__path__[0])")/lib:$LD_LIBRARY_PATH

驗證 / Verify

import onnxruntime
print(onnxruntime.get_available_providers())
# 預期輸出 / Expected: ['CUDAExecutionProvider', 'CPUExecutionProvider']

編譯環境 / Build Environment

在 NVIDIA 官方 PyTorch 容器內編譯（vllm-node，基於 nvcr.io/nvidia/pytorch:26.01-py3）：

CUDA 13.1, cuDNN 9.17.1, Python 3.12
所有 CUDA headers 和 libraries 已預裝

編譯指令 / Build Command

git clone --recursive --depth 1 https://github.com/microsoft/onnxruntime
cd onnxruntime

export CMAKE_BUILD_PARALLEL_LEVEL=4  # 限制並行數，避免 OOM

./build.sh \
  --config Release \
  --build_shared_lib \
  --parallel 4 \
  --use_cuda \
  --cuda_home /usr/local/cuda \
  --cudnn_home /usr \
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121 \
  --build_wheel \
  --skip_tests \
  --allow_running_as_root

重要注意事項 / Important Notes

務必使用 --parallel 4（不要用 nproc）。CUDA kernel 編譯（尤其是 flash attention）會消耗超過 50GB RAM，在 128GB 統一記憶體的系統上使用全部核心會導致 OOM 凍結。
在 Docker 容器內編譯需加 --allow_running_as_root。
CMAKE_CUDA_ARCHITECTURES=121 是支援 SM121 Blackwell 的關鍵參數。
編譯時間：GB10 上使用 4 個並行 job 約 45 分鐘。

測試硬體 / Tested Hardware

ASUS Ascent GX10 (NVIDIA GB10, 128GB LPDDR5x, aarch64)
NVIDIA Driver 580.126.09, CUDA 13.0

應用案例：Fun-ASR-Nano-GGUF + CUDA Encoder

這個 wheel 的誕生是為了加速 CapsWriter-Offline 的 Fun-ASR-Nano-GGUF 管線中的 ONNX encoder，在 DGX Spark 上實現 GPU 加速：

This wheel was built to accelerate the ONNX encoder in CapsWriter-Offline's Fun-ASR-Nano-GGUF pipeline on DGX Spark:

配置 / Configuration	20 段音檔 / 20 audio segments	加速比 / Speedup
PyTorch (FunASR)	40.4s	1.0x
ONNX+GGUF (CPU encoder)	9.23s	4.4x
ONNX+GGUF (CUDA encoder)	5.75s	7.0x

參考資料 / References

授權 / License

ONNX Runtime 採用 MIT License 授權。

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support