File size: 3,684 Bytes

---
license: mit
---
# RTX 5000 Series–Ready `llama-cpp-python` Wheel (Python 3.12, Windows)

**Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors  
**Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`  
**License:** MIT (same as upstream `llama-cpp-python`)  

**Platform:** Windows 10/11 x64  
**Python:** 3.12  
**CUDA:** 12.8 (optimized for Blackwell)

---

## 🚀 Performance (Verified on RTX 5090)

- ~64 tokens/sec on *Mistral Small 24B* (5-bit quant)
- Full GPU offload (`n_gpu_layers = -1`) working as expected
- ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
- 32 GB VRAM fully utilized (no kernel crashes)

> Notes: numbers vary with quant, context, and params; these are representative.

---

## 🔧 Why This Works

The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels.  
On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger:
“CUDA error: invalid resource handle”.

cuBLAS is stable on 5090 and avoids those kernel issues.

**Key CMake flags used:**
    -DGGML_CUDA=ON
    -DGGML_CUDA_FORCE_CUBLAS=1      # Use cuBLAS instead of custom kernels
    -DGGML_CUDA_NO_PINNED=1         # Avoid pinned memory issues with GDDR7
    -DGGML_CUDA_F16=0               # Disable problematic FP16 code paths
    -DCMAKE_CUDA_ARCHITECTURES=all-major  # Ensure sm_120 is included

---

## 📋 Requirements

- NVIDIA RTX 5090 (or other Blackwell GPU)
- NVIDIA drivers 570.86.10+
- CUDA Toolkit 12.8
- Python 3.12
- Windows 10/11 x64
- Microsoft Visual C++ Redistributable 2015–2022

---

## 🛠️ Installation

1) Download the wheel:  
   `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`

2) Install:
    pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl

---

## ✅ Quick Verification

    from llama_cpp import Llama

    # Full GPU offload on 5090
    llm = Llama(
        model_path="your_model.gguf",
        n_gpu_layers=-1,   # full GPU
        n_ctx=2048,
        verbose=True
    )

    out = llm("Hello, how are you?", max_tokens=20)
    print(out["choices"][0]["text"])

**What to look for in stdout:**
- CUDA device assignment lines (e.g., using CUDA:0)
- VRAM allocations *without* any “invalid resource handle” errors

---

## 🏗️ Build It Yourself (Advanced)

**Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12

    mkdir C:\wheels
    cd C:\wheels

    set FORCE_CMAKE=1
    set CMAKE_BUILD_PARALLEL_LEVEL=15
    set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major

    pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose

**Build time:** ~10 minutes on a modern CPU  
**Wheel size:** ~231 MB (larger due to cuBLAS inclusion)

---

## 🐛 Troubleshooting

**“Invalid resource handle” errors**
- This wheel specifically fixes this. If you still see them, verify:
  - CUDA 12.8 is installed
  - Latest NVIDIA drivers are installed
  - No other CUDA apps are interfering

**CPU fallback**
- If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set.

---

## 🙏 Credits

Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**.  
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.

- For issues with this specific wheel: *open an issue here (this repo/thread).*  
- For general `llama-cpp-python` issues: use the official repository.

---

Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉