File size: 4,321 Bytes
4b344d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
---
# ONNX Runtime GPU 1.24.0 - CUDA 13.0 Build with Blackwell Support

## Overview

Custom-built **ONNX Runtime GPU 1.24.0** for Windows with full **CUDA 13.0** and **Blackwell architecture (sm_120)** support. This build addresses the `cudaErrorNoKernelImageForDevice` error that occurs with RTX 5060 Ti and other Blackwell-generation GPUs when using official PyPI distributions.

## Build Specifications

### Environment
- **OS**: Windows 10/11 x64
- **CUDA Toolkit**: 13.0
- **cuDNN**: 9.13 (CUDA 13.0 compatible)
- **Visual Studio**: 2022 (v17.x) with Desktop development with C++
- **Python**: 3.13
- **CMake**: 3.26+

### Supported GPU Architectures
- **sm_89**: Ada Lovelace (RTX 4060, 4070, etc.)
- **sm_90**: Ada Lovelace High-end (RTX 4090) / Hopper (H100)
- **sm_120**: Blackwell (RTX 5060 Ti, 5080, 5090)

### Build Configuration

```cmake
CMAKE_CUDA_ARCHITECTURES=89;90;120
onnxruntime_USE_FLASH_ATTENTION=OFF
CUDA_VERSION=13.0
```

**Note**: Flash Attention is disabled because ONNX Runtime 1.24.0's Flash Attention kernels are sm_80-specific and incompatible with sm_90/sm_120 architectures.

## Installation

```bash
pip install onnxruntime_gpu-1.24.0-cp313-cp313-win_amd64.whl
```

### Verify Installation

```python
import onnxruntime as ort
print(f"Version: {ort.__version__}")
print(f"Providers: {ort.get_available_providers()}")
# Expected output: ['CUDAExecutionProvider', 'CPUExecutionProvider']
```

## Key Features

**Blackwell GPU Support**: Full compatibility with RTX 5060 Ti, 5080, 5090  
**CUDA 13.0 Optimized**: Built with latest CUDA toolkit for optimal performance  
**Multi-Architecture**: Single build supports Ada Lovelace and Blackwell  
**Stable for Inference**: Tested with WD14Tagger, Stable Diffusion pipelines  

## Known Limitations

⚠️ **Flash Attention Disabled**: Due to sm_80-only kernel implementation in ONNX Runtime 1.24.0, Flash Attention is not available. This has minimal impact on most inference workloads (e.g., WD14Tagger, image generation models).

⚠️ **Windows Only**: This build is specifically for Windows x64. Linux users should build from source with similar configurations.

## Performance

Compared to CPU-only execution:
- **Image tagging (WD14Tagger)**: 10-50x faster
- **Inference latency**: Significant reduction on GPU-accelerated operations
- **Memory**: Efficiently utilizes 16GB VRAM on RTX 5060 Ti

## Use Cases

- **ComfyUI**: WD14Tagger nodes
- **Stable Diffusion Forge**: ONNX-based models
- **General ONNX Model Inference**: Any ONNX model requiring CUDA acceleration

## Technical Background

### Why This Build is Necessary

Official ONNX Runtime GPU distributions (PyPI) are typically built for older CUDA versions (11.x/12.x) and do not include sm_120 (Blackwell) architecture support. When running inference on Blackwell GPUs with official builds, users encounter:

```
cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
```

This custom build resolves the issue by:
1. Compiling with CUDA 13.0
2. Explicitly targeting sm_89, sm_90, sm_120
3. Disabling incompatible Flash Attention kernels

### Flash Attention Status

ONNX Runtime's Flash Attention implementation currently only supports:
- **sm_80**: Ampere (A100, RTX 3090)
- Kernels are hardcoded with `*_sm80.cu` file naming

Future ONNX Runtime versions may add sm_90/sm_120 support, but as of 1.24.0, this remains unavailable.

## Build Script

For those who want to replicate this build:

```batch
build.bat ^
  --config Release ^
  --build_shared_lib ^
  --parallel ^
  --use_cuda ^
  --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0" ^
  --cudnn_home "C:\Program Files\NVIDIA\CUDNN\v9.13" ^
  --cuda_version=13.0 ^
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="89;90;120" ^
                         CUDNN_INCLUDE_DIR="C:\Program Files\NVIDIA\CUDNN\v9.13\include\13.0" ^
                         CUDNN_LIBRARY="C:\Program Files\NVIDIA\CUDNN\v9.13\lib\13.0\x64\cudnn.lib" ^
                         onnxruntime_USE_FLASH_ATTENTION=OFF ^
  --build_wheel ^
  --skip_tests
```

## Credits

Built by [@ussoewwin](https://huggingface.co/ussoewwin) for the community facing Blackwell GPU compatibility issues with ONNX Runtime.

## License

Apache 2.0 (same as ONNX Runtime)