Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,138 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- llama-cpp-python,
|
| 5 |
+
- cuda,
|
| 6 |
+
- gemma
|
| 7 |
+
- gemma-3,
|
| 8 |
+
- windows,
|
| 9 |
+
- wheel,
|
| 10 |
+
- prebuilt,
|
| 11 |
+
- .whl,
|
| 12 |
+
- local-llm,
|
| 13 |
---
|
| 14 |
+
# llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support)
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
🛠️ **Built with** [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit)
|
| 18 |
+
---
|
| 19 |
+
**Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).**
|
| 20 |
+
|
| 21 |
+
This repository provides a prebuilt Python wheel (`.whl`) file for **llama-cpp-python**, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled.
|
| 22 |
+
|
| 23 |
+
Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems.
|
| 24 |
+
|
| 25 |
+
This build is based on **llama-cpp-python** version `0.3.8` of the Python bindings, and the underlying **llama.cpp** source code as of **April 26, 2025**. It has been verified to work with **Gemma 3 models**, correctly offloading layers to the GPU.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Features
|
| 30 |
+
|
| 31 |
+
- **Prebuilt for Windows x64**: Ready to install using `pip` on 64-bit Windows systems.
|
| 32 |
+
- **CUDA 12.8 Accelerated**: Leverages your NVIDIA GPU for faster inference.
|
| 33 |
+
- **Gemma 3 Support**: Verified compatibility with Gemma 3 models.
|
| 34 |
+
- **Based on llama-cpp-python version `0.3.8` bindings.**
|
| 35 |
+
- **Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.**
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Compatibility & Prerequisites
|
| 40 |
+
|
| 41 |
+
To use this wheel, you must have:
|
| 42 |
+
|
| 43 |
+
- An **NVIDIA GPU**.
|
| 44 |
+
- NVIDIA drivers compatible with **CUDA 12.8** installed.
|
| 45 |
+
- **Windows 10 or Windows 11 (x64)**.
|
| 46 |
+
- **Python 3.8 or higher** (the wheel is built specifically for **Python 3.11** (`cp311`)).
|
| 47 |
+
- The **Visual C++ Redistributable for Visual Studio 2015-2022** installed.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## Installation
|
| 52 |
+
|
| 53 |
+
It is highly recommended to install this wheel within a Python virtual environment.
|
| 54 |
+
|
| 55 |
+
1. Ensure you have met all the prerequisites listed above.
|
| 56 |
+
2. Create and activate a Python virtual environment:
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
python -m venv venv_llama
|
| 60 |
+
.\venv_llama\Scripts\activate
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
3. Download the `.whl` file from this repository's **Releases** section.
|
| 64 |
+
4. Open your Command Prompt or PowerShell.
|
| 65 |
+
5. Navigate to the directory where you downloaded the `.whl` file.
|
| 66 |
+
6. Install the wheel using `pip`:
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Verification (Check CUDA Usage)
|
| 75 |
+
|
| 76 |
+
To verify that `llama-cpp-python` is using your GPU via CUDA after installation:
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');"
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file.
|
| 83 |
+
|
| 84 |
+
Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports.
|
| 85 |
+
|
| 86 |
+
## Alternative Verification: Python Script
|
| 87 |
+
|
| 88 |
+
If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment.
|
| 89 |
+
|
| 90 |
+
Replace the placeholder paths below with your actual .dll and .gguf file locations:
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
import os
|
| 94 |
+
from llama_cpp import Llama
|
| 95 |
+
|
| 96 |
+
# Set the environment variable to point to your custom-built llama.dll
|
| 97 |
+
os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL'
|
| 98 |
+
|
| 99 |
+
try:
|
| 100 |
+
print('Attempting to initialize Llama with GPU offload (-1 layers)...')
|
| 101 |
+
|
| 102 |
+
# Initialize the Llama model with full GPU offloading
|
| 103 |
+
model = Llama(
|
| 104 |
+
model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf',
|
| 105 |
+
n_gpu_layers=-1,
|
| 106 |
+
verbose=True
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).')
|
| 110 |
+
|
| 111 |
+
except FileNotFoundError:
|
| 112 |
+
print('Error: Model file not found. Please double-check your model_path.')
|
| 113 |
+
except Exception as e:
|
| 114 |
+
print(f'An error occurred during initialization: {e}')
|
| 115 |
+
```
|
| 116 |
+
**What to look for in the output:**
|
| 117 |
+
|
| 118 |
+
Lines like assigned to device CUDA0, assigned to device CUDA1.
|
| 119 |
+
|
| 120 |
+
VRAM buffer allocations such as CUDA0 model buffer size = ... MiB.
|
| 121 |
+
|
| 122 |
+
Confirmation that your GPU(s) are being used for model layer offloading.
|
| 123 |
+
|
| 124 |
+
## Usage
|
| 125 |
+
Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions.
|
| 126 |
+
|
| 127 |
+
## Acknowledgments
|
| 128 |
+
This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov).
|
| 129 |
+
|
| 130 |
+
This specific wheel was built by Bernard Peter Fitzgerald (@boneylizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility.
|
| 131 |
+
|
| 132 |
+
## License
|
| 133 |
+
This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project.
|
| 134 |
+
|
| 135 |
+
## Reporting Issues
|
| 136 |
+
If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker.
|
| 137 |
+
|
| 138 |
+
For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues).
|