Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,136 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# ONNX Runtime GPU 1.24.0 - CUDA 13.0 Build with Blackwell Support
|
| 5 |
+
|
| 6 |
+
## Overview
|
| 7 |
+
|
| 8 |
+
Custom-built **ONNX Runtime GPU 1.24.0** for Windows with full **CUDA 13.0** and **Blackwell architecture (sm_120)** support. This build addresses the `cudaErrorNoKernelImageForDevice` error that occurs with RTX 5060 Ti and other Blackwell-generation GPUs when using official PyPI distributions.
|
| 9 |
+
|
| 10 |
+
## Build Specifications
|
| 11 |
+
|
| 12 |
+
### Environment
|
| 13 |
+
- **OS**: Windows 10/11 x64
|
| 14 |
+
- **CUDA Toolkit**: 13.0
|
| 15 |
+
- **cuDNN**: 9.13 (CUDA 13.0 compatible)
|
| 16 |
+
- **Visual Studio**: 2022 (v17.x) with Desktop development with C++
|
| 17 |
+
- **Python**: 3.13
|
| 18 |
+
- **CMake**: 3.26+
|
| 19 |
+
|
| 20 |
+
### Supported GPU Architectures
|
| 21 |
+
- **sm_89**: Ada Lovelace (RTX 4060, 4070, etc.)
|
| 22 |
+
- **sm_90**: Ada Lovelace High-end (RTX 4090) / Hopper (H100)
|
| 23 |
+
- **sm_120**: Blackwell (RTX 5060 Ti, 5080, 5090)
|
| 24 |
+
|
| 25 |
+
### Build Configuration
|
| 26 |
+
|
| 27 |
+
```cmake
|
| 28 |
+
CMAKE_CUDA_ARCHITECTURES=89;90;120
|
| 29 |
+
onnxruntime_USE_FLASH_ATTENTION=OFF
|
| 30 |
+
CUDA_VERSION=13.0
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
**Note**: Flash Attention is disabled because ONNX Runtime 1.24.0's Flash Attention kernels are sm_80-specific and incompatible with sm_90/sm_120 architectures.
|
| 34 |
+
|
| 35 |
+
## Installation
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
pip install onnxruntime_gpu-1.24.0-cp313-cp313-win_amd64.whl
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
### Verify Installation
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
import onnxruntime as ort
|
| 45 |
+
print(f"Version: {ort.__version__}")
|
| 46 |
+
print(f"Providers: {ort.get_available_providers()}")
|
| 47 |
+
# Expected output: ['CUDAExecutionProvider', 'CPUExecutionProvider']
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Key Features
|
| 51 |
+
|
| 52 |
+
✅ **Blackwell GPU Support**: Full compatibility with RTX 5060 Ti, 5080, 5090
|
| 53 |
+
✅ **CUDA 13.0 Optimized**: Built with latest CUDA toolkit for optimal performance
|
| 54 |
+
✅ **Multi-Architecture**: Single build supports Ada Lovelace and Blackwell
|
| 55 |
+
✅ **Stable for Inference**: Tested with WD14Tagger, Stable Diffusion pipelines
|
| 56 |
+
|
| 57 |
+
## Known Limitations
|
| 58 |
+
|
| 59 |
+
⚠️ **Flash Attention Disabled**: Due to sm_80-only kernel implementation in ONNX Runtime 1.24.0, Flash Attention is not available. This has minimal impact on most inference workloads (e.g., WD14Tagger, image generation models).
|
| 60 |
+
|
| 61 |
+
⚠️ **Windows Only**: This build is specifically for Windows x64. Linux users should build from source with similar configurations.
|
| 62 |
+
|
| 63 |
+
## Performance
|
| 64 |
+
|
| 65 |
+
Compared to CPU-only execution:
|
| 66 |
+
- **Image tagging (WD14Tagger)**: 10-50x faster
|
| 67 |
+
- **Inference latency**: Significant reduction on GPU-accelerated operations
|
| 68 |
+
- **Memory**: Efficiently utilizes 16GB VRAM on RTX 5060 Ti
|
| 69 |
+
|
| 70 |
+
## Use Cases
|
| 71 |
+
|
| 72 |
+
- **ComfyUI**: WD14Tagger nodes
|
| 73 |
+
- **Stable Diffusion Forge**: ONNX-based models
|
| 74 |
+
- **General ONNX Model Inference**: Any ONNX model requiring CUDA acceleration
|
| 75 |
+
|
| 76 |
+
## Technical Background
|
| 77 |
+
|
| 78 |
+
### Why This Build is Necessary
|
| 79 |
+
|
| 80 |
+
Official ONNX Runtime GPU distributions (PyPI) are typically built for older CUDA versions (11.x/12.x) and do not include sm_120 (Blackwell) architecture support. When running inference on Blackwell GPUs with official builds, users encounter:
|
| 81 |
+
|
| 82 |
+
```
|
| 83 |
+
cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
This custom build resolves the issue by:
|
| 87 |
+
1. Compiling with CUDA 13.0
|
| 88 |
+
2. Explicitly targeting sm_89, sm_90, sm_120
|
| 89 |
+
3. Disabling incompatible Flash Attention kernels
|
| 90 |
+
|
| 91 |
+
### Flash Attention Status
|
| 92 |
+
|
| 93 |
+
ONNX Runtime's Flash Attention implementation currently only supports:
|
| 94 |
+
- **sm_80**: Ampere (A100, RTX 3090)
|
| 95 |
+
- Kernels are hardcoded with `*_sm80.cu` file naming
|
| 96 |
+
|
| 97 |
+
Future ONNX Runtime versions may add sm_90/sm_120 support, but as of 1.24.0, this remains unavailable.
|
| 98 |
+
|
| 99 |
+
## Build Script
|
| 100 |
+
|
| 101 |
+
For those who want to replicate this build:
|
| 102 |
+
|
| 103 |
+
```batch
|
| 104 |
+
build.bat ^
|
| 105 |
+
--config Release ^
|
| 106 |
+
--build_shared_lib ^
|
| 107 |
+
--parallel ^
|
| 108 |
+
--use_cuda ^
|
| 109 |
+
--cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0" ^
|
| 110 |
+
--cudnn_home "C:\Program Files\NVIDIA\CUDNN\v9.13" ^
|
| 111 |
+
--cuda_version=13.0 ^
|
| 112 |
+
--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="89;90;120" ^
|
| 113 |
+
CUDNN_INCLUDE_DIR="C:\Program Files\NVIDIA\CUDNN\v9.13\include\13.0" ^
|
| 114 |
+
CUDNN_LIBRARY="C:\Program Files\NVIDIA\CUDNN\v9.13\lib\13.0\x64\cudnn.lib" ^
|
| 115 |
+
onnxruntime_USE_FLASH_ATTENTION=OFF ^
|
| 116 |
+
--build_wheel ^
|
| 117 |
+
--skip_tests
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## Credits
|
| 121 |
+
|
| 122 |
+
Built by [@ussoewwin](https://huggingface.co/ussoewwin) for the community facing Blackwell GPU compatibility issues with ONNX Runtime.
|
| 123 |
+
|
| 124 |
+
## License
|
| 125 |
+
|
| 126 |
+
Apache 2.0 (same as ONNX Runtime)
|
| 127 |
+
|
| 128 |
+
## Related Projects
|
| 129 |
+
|
| 130 |
+
- [Flash-Attention-2 for Windows](https://huggingface.co/ussoewwin/Flash-Attention-2_for_Windows)
|
| 131 |
+
- [MediaPipe 0.10.21 Python 3.13](https://huggingface.co/ussoewwin/mediapipe-0.10.21-Python3.13)
|
| 132 |
+
- [Nunchaku 1.0.1 torch2.9 cp313](https://huggingface.co/ussoewwin/nunchaku-1.0.1-torch2.9-cp313-cp313-win_amd64)
|
| 133 |
+
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
**For issues or questions**: Open an issue on the [community discussion](https://huggingface.co/ussoewwin/onnxruntime-gpu-1.24.0/discussions)
|