boneylizardwizard's picture
Update README.md
f596de4 verified
---
license: mit
---
# RTX 5000 Series–Ready `llama-cpp-python` Wheel (Python 3.12, Windows)
**Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors
**Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
**License:** MIT (same as upstream `llama-cpp-python`)
**Platform:** Windows 10/11 x64
**Python:** 3.12
**CUDA:** 12.8 (optimized for Blackwell)
---
## 🚀 Performance (Verified on RTX 5090)
- ~64 tokens/sec on *Mistral Small 24B* (5-bit quant)
- Full GPU offload (`n_gpu_layers = -1`) working as expected
- ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
- 32 GB VRAM fully utilized (no kernel crashes)
> Notes: numbers vary with quant, context, and params; these are representative.
---
## 🔧 Why This Works
The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels.
On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger:
“CUDA error: invalid resource handle”.
cuBLAS is stable on 5090 and avoids those kernel issues.
**Key CMake flags used:**
-DGGML_CUDA=ON
-DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels
-DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7
-DGGML_CUDA_F16=0 # Disable problematic FP16 code paths
-DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included
---
## 📋 Requirements
- NVIDIA RTX 5090 (or other Blackwell GPU)
- NVIDIA drivers 570.86.10+
- CUDA Toolkit 12.8
- Python 3.12
- Windows 10/11 x64
- Microsoft Visual C++ Redistributable 2015–2022
---
## 🛠️ Installation
1) Download the wheel:
`llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
2) Install:
pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
---
## ✅ Quick Verification
from llama_cpp import Llama
# Full GPU offload on 5090
llm = Llama(
model_path="your_model.gguf",
n_gpu_layers=-1, # full GPU
n_ctx=2048,
verbose=True
)
out = llm("Hello, how are you?", max_tokens=20)
print(out["choices"][0]["text"])
**What to look for in stdout:**
- CUDA device assignment lines (e.g., using CUDA:0)
- VRAM allocations *without* any “invalid resource handle” errors
---
## 🏗️ Build It Yourself (Advanced)
**Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12
mkdir C:\wheels
cd C:\wheels
set FORCE_CMAKE=1
set CMAKE_BUILD_PARALLEL_LEVEL=15
set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major
pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose
**Build time:** ~10 minutes on a modern CPU
**Wheel size:** ~231 MB (larger due to cuBLAS inclusion)
---
## 🐛 Troubleshooting
**“Invalid resource handle” errors**
- This wheel specifically fixes this. If you still see them, verify:
- CUDA 12.8 is installed
- Latest NVIDIA drivers are installed
- No other CUDA apps are interfering
**CPU fallback**
- If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set.
---
## 🙏 Credits
Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**.
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.
- For issues with this specific wheel: *open an issue here (this repo/thread).*
- For general `llama-cpp-python` issues: use the official repository.
---
Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉