boneylizardwizard commited on
Commit
0824501
·
verified ·
1 Parent(s): 64691d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -3
README.md CHANGED
@@ -1,3 +1,127 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # RTX 5090–Ready `llama-cpp-python` Wheel (Python 3.12, Windows)
5
+
6
+ **Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors
7
+ **Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
8
+ **License:** MIT (same as upstream `llama-cpp-python`)
9
+
10
+ **Platform:** Windows 10/11 x64
11
+ **Python:** 3.12
12
+ **CUDA:** 12.8 (optimized for Blackwell)
13
+
14
+ ---
15
+
16
+ ## 🚀 Performance (Verified on RTX 5090)
17
+
18
+ - ~64 tokens/sec on *Mistral Small 24B* (5-bit quant)
19
+ - Full GPU offload (`n_gpu_layers = -1`) working as expected
20
+ - ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
21
+ - 32 GB VRAM fully utilized (no kernel crashes)
22
+
23
+ > Notes: numbers vary with quant, context, and params; these are representative.
24
+
25
+ ---
26
+
27
+ ## 🔧 Why This Works
28
+
29
+ The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels.
30
+ On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger:
31
+ “CUDA error: invalid resource handle”.
32
+
33
+ cuBLAS is stable on 5090 and avoids those kernel issues.
34
+
35
+ **Key CMake flags used:**
36
+ -DGGML_CUDA=ON
37
+ -DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels
38
+ -DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7
39
+ -DGGML_CUDA_F16=0 # Disable problematic FP16 code paths
40
+ -DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included
41
+
42
+ ---
43
+
44
+ ## 📋 Requirements
45
+
46
+ - NVIDIA RTX 5090 (or other Blackwell GPU)
47
+ - NVIDIA drivers 570.86.10+
48
+ - CUDA Toolkit 12.8
49
+ - Python 3.12
50
+ - Windows 10/11 x64
51
+ - Microsoft Visual C++ Redistributable 2015–2022
52
+
53
+ ---
54
+
55
+ ## 🛠️ Installation
56
+
57
+ 1) Download the wheel:
58
+ `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
59
+
60
+ 2) Install:
61
+ pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
62
+
63
+ ---
64
+
65
+ ## ✅ Quick Verification
66
+
67
+ from llama_cpp import Llama
68
+
69
+ # Full GPU offload on 5090
70
+ llm = Llama(
71
+ model_path="your_model.gguf",
72
+ n_gpu_layers=-1, # full GPU
73
+ n_ctx=2048,
74
+ verbose=True
75
+ )
76
+
77
+ out = llm("Hello, how are you?", max_tokens=20)
78
+ print(out["choices"][0]["text"])
79
+
80
+ **What to look for in stdout:**
81
+ - CUDA device assignment lines (e.g., using CUDA:0)
82
+ - VRAM allocations *without* any “invalid resource handle” errors
83
+
84
+ ---
85
+
86
+ ## 🏗️ Build It Yourself (Advanced)
87
+
88
+ **Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12
89
+
90
+ mkdir C:\wheels
91
+ cd C:\wheels
92
+
93
+ set FORCE_CMAKE=1
94
+ set CMAKE_BUILD_PARALLEL_LEVEL=15
95
+ set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major
96
+
97
+ pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose
98
+
99
+ **Build time:** ~10 minutes on a modern CPU
100
+ **Wheel size:** ~231 MB (larger due to cuBLAS inclusion)
101
+
102
+ ---
103
+
104
+ ## 🐛 Troubleshooting
105
+
106
+ **“Invalid resource handle” errors**
107
+ - This wheel specifically fixes this. If you still see them, verify:
108
+ - CUDA 12.8 is installed
109
+ - Latest NVIDIA drivers are installed
110
+ - No other CUDA apps are interfering
111
+
112
+ **CPU fallback**
113
+ - If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set.
114
+
115
+ ---
116
+
117
+ ## 🙏 Credits
118
+
119
+ Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**.
120
+ This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.
121
+
122
+ - For issues with this specific wheel: *open an issue here (this repo/thread).*
123
+ - For general `llama-cpp-python` issues: use the official repository.
124
+
125
+ ---
126
+
127
+ Finally — RTX 5090 owners can use their flagship GPU for local LLM inference without crashes! 🎉