--- license: mit --- # RTX 5000 Series–Ready `llama-cpp-python` Wheel (Python 3.12, Windows) **Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors **Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl` **License:** MIT (same as upstream `llama-cpp-python`) **Platform:** Windows 10/11 x64 **Python:** 3.12 **CUDA:** 12.8 (optimized for Blackwell) --- ## 🚀 Performance (Verified on RTX 5090) - ~64 tokens/sec on *Mistral Small 24B* (5-bit quant) - Full GPU offload (`n_gpu_layers = -1`) working as expected - ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s) - 32 GB VRAM fully utilized (no kernel crashes) > Notes: numbers vary with quant, context, and params; these are representative. --- ## 🔧 Why This Works The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels. On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger: “CUDA error: invalid resource handle”. cuBLAS is stable on 5090 and avoids those kernel issues. **Key CMake flags used:** -DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels -DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7 -DGGML_CUDA_F16=0 # Disable problematic FP16 code paths -DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included --- ## 📋 Requirements - NVIDIA RTX 5090 (or other Blackwell GPU) - NVIDIA drivers 570.86.10+ - CUDA Toolkit 12.8 - Python 3.12 - Windows 10/11 x64 - Microsoft Visual C++ Redistributable 2015–2022 --- ## 🛠️ Installation 1) Download the wheel: `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl` 2) Install: pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl --- ## ✅ Quick Verification from llama_cpp import Llama # Full GPU offload on 5090 llm = Llama( model_path="your_model.gguf", n_gpu_layers=-1, # full GPU n_ctx=2048, verbose=True ) out = llm("Hello, how are you?", max_tokens=20) print(out["choices"][0]["text"]) **What to look for in stdout:** - CUDA device assignment lines (e.g., using CUDA:0) - VRAM allocations *without* any “invalid resource handle” errors --- ## 🏗️ Build It Yourself (Advanced) **Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12 mkdir C:\wheels cd C:\wheels set FORCE_CMAKE=1 set CMAKE_BUILD_PARALLEL_LEVEL=15 set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose **Build time:** ~10 minutes on a modern CPU **Wheel size:** ~231 MB (larger due to cuBLAS inclusion) --- ## 🐛 Troubleshooting **“Invalid resource handle” errors** - This wheel specifically fixes this. If you still see them, verify: - CUDA 12.8 is installed - Latest NVIDIA drivers are installed - No other CUDA apps are interfering **CPU fallback** - If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set. --- ## 🙏 Credits Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**. This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release. - For issues with this specific wheel: *open an issue here (this repo/thread).* - For general `llama-cpp-python` issues: use the official repository. --- Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉