|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
# RTX 5000 Series–Ready `llama-cpp-python` Wheel (Python 3.12, Windows) |
|
|
|
|
|
**Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors |
|
|
**Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl` |
|
|
**License:** MIT (same as upstream `llama-cpp-python`) |
|
|
|
|
|
**Platform:** Windows 10/11 x64 |
|
|
**Python:** 3.12 |
|
|
**CUDA:** 12.8 (optimized for Blackwell) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Performance (Verified on RTX 5090) |
|
|
|
|
|
- ~64 tokens/sec on *Mistral Small 24B* (5-bit quant) |
|
|
- Full GPU offload (`n_gpu_layers = -1`) working as expected |
|
|
- ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s) |
|
|
- 32 GB VRAM fully utilized (no kernel crashes) |
|
|
|
|
|
> Notes: numbers vary with quant, context, and params; these are representative. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 Why This Works |
|
|
|
|
|
The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels. |
|
|
On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger: |
|
|
“CUDA error: invalid resource handle”. |
|
|
|
|
|
cuBLAS is stable on 5090 and avoids those kernel issues. |
|
|
|
|
|
**Key CMake flags used:** |
|
|
-DGGML_CUDA=ON |
|
|
-DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels |
|
|
-DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7 |
|
|
-DGGML_CUDA_F16=0 # Disable problematic FP16 code paths |
|
|
-DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included |
|
|
|
|
|
--- |
|
|
|
|
|
## 📋 Requirements |
|
|
|
|
|
- NVIDIA RTX 5090 (or other Blackwell GPU) |
|
|
- NVIDIA drivers 570.86.10+ |
|
|
- CUDA Toolkit 12.8 |
|
|
- Python 3.12 |
|
|
- Windows 10/11 x64 |
|
|
- Microsoft Visual C++ Redistributable 2015–2022 |
|
|
|
|
|
--- |
|
|
|
|
|
## 🛠️ Installation |
|
|
|
|
|
1) Download the wheel: |
|
|
`llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl` |
|
|
|
|
|
2) Install: |
|
|
pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl |
|
|
|
|
|
--- |
|
|
|
|
|
## ✅ Quick Verification |
|
|
|
|
|
from llama_cpp import Llama |
|
|
|
|
|
# Full GPU offload on 5090 |
|
|
llm = Llama( |
|
|
model_path="your_model.gguf", |
|
|
n_gpu_layers=-1, # full GPU |
|
|
n_ctx=2048, |
|
|
verbose=True |
|
|
) |
|
|
|
|
|
out = llm("Hello, how are you?", max_tokens=20) |
|
|
print(out["choices"][0]["text"]) |
|
|
|
|
|
**What to look for in stdout:** |
|
|
- CUDA device assignment lines (e.g., using CUDA:0) |
|
|
- VRAM allocations *without* any “invalid resource handle” errors |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏗️ Build It Yourself (Advanced) |
|
|
|
|
|
**Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12 |
|
|
|
|
|
mkdir C:\wheels |
|
|
cd C:\wheels |
|
|
|
|
|
set FORCE_CMAKE=1 |
|
|
set CMAKE_BUILD_PARALLEL_LEVEL=15 |
|
|
set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major |
|
|
|
|
|
pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose |
|
|
|
|
|
**Build time:** ~10 minutes on a modern CPU |
|
|
**Wheel size:** ~231 MB (larger due to cuBLAS inclusion) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🐛 Troubleshooting |
|
|
|
|
|
**“Invalid resource handle” errors** |
|
|
- This wheel specifically fixes this. If you still see them, verify: |
|
|
- CUDA 12.8 is installed |
|
|
- Latest NVIDIA drivers are installed |
|
|
- No other CUDA apps are interfering |
|
|
|
|
|
**CPU fallback** |
|
|
- If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🙏 Credits |
|
|
|
|
|
Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**. |
|
|
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release. |
|
|
|
|
|
- For issues with this specific wheel: *open an issue here (this repo/thread).* |
|
|
- For general `llama-cpp-python` issues: use the official repository. |
|
|
|
|
|
--- |
|
|
|
|
|
Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉 |