boneylizardwizard
/

llama_cpp_python-0.3.16-cp312-cp312-win_amd64

Model card Files Files and versions

xet

Community

boneylizardwizard commited on Aug 24, 2025

Commit

0824501

verified ·

1 Parent(s): 64691d0

Update README.md

Browse files

Files changed (1) hide show

README.md +127 -3

README.md CHANGED Viewed

@@ -1,3 +1,127 @@
----
-license: mit
----

+---
+license: mit
+---
+# RTX 5090–Ready `llama-cpp-python` Wheel (Python 3.12, Windows)
+**Status:** ✅ CONFIRMED WORKING — No more “invalid resource handle” errors
+**Wheel:** `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
+**License:** MIT (same as upstream `llama-cpp-python`)
+**Platform:** Windows 10/11 x64
+**Python:** 3.12
+**CUDA:** 12.8 (optimized for Blackwell)
+---
+## 🚀 Performance (Verified on RTX 5090)
+- ~64 tokens/sec on *Mistral Small 24B* (5-bit quant)
+- Full GPU offload (`n_gpu_layers = -1`) working as expected
+- ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
+- 32 GB VRAM fully utilized (no kernel crashes)
+> Notes: numbers vary with quant, context, and params; these are representative.
+---
+## 🔧 Why This Works
+The wheel forces **cuBLAS** instead of ggml’s custom CUDA kernels.
+On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger:
+“CUDA error: invalid resource handle”.
+cuBLAS is stable on 5090 and avoids those kernel issues.
+**Key CMake flags used:**
+    -DGGML_CUDA=ON
+    -DGGML_CUDA_FORCE_CUBLAS=1      # Use cuBLAS instead of custom kernels
+    -DGGML_CUDA_NO_PINNED=1         # Avoid pinned memory issues with GDDR7
+    -DGGML_CUDA_F16=0               # Disable problematic FP16 code paths
+    -DCMAKE_CUDA_ARCHITECTURES=all-major  # Ensure sm_120 is included
+---
+## 📋 Requirements
+- NVIDIA RTX 5090 (or other Blackwell GPU)
+- NVIDIA drivers 570.86.10+
+- CUDA Toolkit 12.8
+- Python 3.12
+- Windows 10/11 x64
+- Microsoft Visual C++ Redistributable 2015–2022
+---
+## 🛠️ Installation
+1) Download the wheel:
+   `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
+2) Install:
+    pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
+---
+## ✅ Quick Verification
+    from llama_cpp import Llama
+    # Full GPU offload on 5090
+    llm = Llama(
+        model_path="your_model.gguf",
+        n_gpu_layers=-1,   # full GPU
+        n_ctx=2048,
+        verbose=True
+    )
+    out = llm("Hello, how are you?", max_tokens=20)
+    print(out["choices"][0]["text"])
+**What to look for in stdout:**
+- CUDA device assignment lines (e.g., using CUDA:0)
+- VRAM allocations *without* any “invalid resource handle” errors
+---
+## 🏗️ Build It Yourself (Advanced)
+**Prereqs:** CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12
+    mkdir C:\wheels
+    cd C:\wheels
+    set FORCE_CMAKE=1
+    set CMAKE_BUILD_PARALLEL_LEVEL=15
+    set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major
+    pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose
+**Build time:** ~10 minutes on a modern CPU
+**Wheel size:** ~231 MB (larger due to cuBLAS inclusion)
+---
+## 🐛 Troubleshooting
+**“Invalid resource handle” errors**
+- This wheel specifically fixes this. If you still see them, verify:
+  - CUDA 12.8 is installed
+  - Latest NVIDIA drivers are installed
+  - No other CUDA apps are interfering
+**CPU fallback**
+- If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set.
+---
+## 🙏 Credits
+Built using the open-source `llama-cpp-python` project by **abetlen** and the `llama.cpp` project by **ggml-org**.
+This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.
+- For issues with this specific wheel: *open an issue here (this repo/thread).*
+- For general `llama-cpp-python` issues: use the official repository.
+---
+Finally — RTX 5090 owners can use their flagship GPU for local LLM inference without crashes! 🎉