| # Simple autogenerated Python bindings for ggml | |
| This folder contains: | |
| - Scripts to generate full Python bindings from ggml headers (+ stubs for autocompletion in IDEs) | |
| - Some barebones utils (see [ggml/utils.py](./ggml/utils.py)): | |
| - `ggml.utils.init` builds a context that's freed automatically when the pointer gets GC'd | |
| - `ggml.utils.copy` **copies between same-shaped tensors (numpy or ggml), w/ automatic (de/re)quantization** | |
| - `ggml.utils.numpy` returns a numpy view over a ggml tensor; if it's quantized, it returns a copy (requires `allow_copy=True`) | |
| - Very basic examples (anyone wants to port [llama2.c](https://github.com/karpathy/llama2.c)?) | |
| Provided you set `GGML_LIBRARY=.../path/to/libggml_shared.so` (see instructions below), it's trivial to do some operations on quantized tensors: | |
| ```python | |
| # Make sure libllama.so is in your [DY]LD_LIBRARY_PATH, or set GGML_LIBRARY=.../libggml_shared.so | |
| from ggml import lib, ffi | |
| from ggml.utils import init, copy, numpy | |
| import numpy as np | |
| ctx = init(mem_size=12*1024*1024) | |
| n = 256 | |
| n_threads = 4 | |
| a = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_Q5_K, n) | |
| b = lib.ggml_new_tensor_1d(ctx, lib.GGML_TYPE_F32, n) # Can't both be quantized | |
| sum = lib.ggml_add(ctx, a, b) # all zeroes for now. Will be quantized too! | |
| gf = ffi.new('struct ggml_cgraph*') | |
| lib.ggml_build_forward_expand(gf, sum) | |
| copy(np.array([i for i in range(n)], np.float32), a) | |
| copy(np.array([i*100 for i in range(n)], np.float32), b) | |
| lib.ggml_graph_compute_with_ctx(ctx, gf, n_threads) | |
| print(numpy(a, allow_copy=True)) | |
| # 0. 1.0439453 2.0878906 3.131836 4.1757812 5.2197266. ... | |
| print(numpy(b)) | |
| # 0. 100. 200. 300. 400. 500. ... | |
| print(numpy(sum, allow_copy=True)) | |
| # 0. 105.4375 210.875 316.3125 421.75 527.1875 ... | |
| ``` | |
| ### Prerequisites | |
| You'll need a shared library of ggml to use the bindings. | |
| #### Build libggml_shared.so or libllama.so | |
| As of this writing the best is to use [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)'s generated `libggml_shared.so` or `libllama.so`, which you can build as follows: | |
| ```bash | |
| git clone https://github.com/ggerganov/llama.cpp | |
| # On a CUDA-enabled system add -DLLAMA_CUBLAS=1 | |
| # On a Mac add -DLLAMA_METAL=1 | |
| cmake llama.cpp \ | |
| -B llama_build \ | |
| -DCMAKE_C_FLAGS=-Ofast \ | |
| -DLLAMA_NATIVE=1 \ | |
| -DLLAMA_LTO=1 \ | |
| -DBUILD_SHARED_LIBS=1 \ | |
| -DLLAMA_MPI=1 \ | |
| -DLLAMA_BUILD_TESTS=0 \ | |
| -DLLAMA_BUILD_EXAMPLES=0 | |
| ( cd llama_build && make -j ) | |
| # On Mac, this will be libggml_shared.dylib instead | |
| export GGML_LIBRARY=$PWD/llama_build/libggml_shared.so | |
| # Alternatively, you can just copy it to your system's lib dir, e.g /usr/local/lib | |
| ``` | |
| #### (Optional) Regenerate the bindings and stubs | |
| If you added or changed any signatures of the C API, you'll want to regenerate the bindings ([ggml/cffi.py](./ggml/cffi.py)) and stubs ([ggml/__init__.pyi](./ggml/__init__.pyi)). | |
| Luckily it's a one-liner using [regenerate.py](./regenerate.py): | |
| ```bash | |
| pip install -q cffi | |
| python regenerate.py | |
| ``` | |
| By default it assumes `llama.cpp` was cloned in ../../../llama.cpp (alongside the ggml folder). You can override this with: | |
| ```bash | |
| C_INCLUDE_DIR=$LLAMA_CPP_DIR python regenerate.py | |
| ``` | |
| You can also edit [api.h](./api.h) to control which files should be included in the generated bindings (defaults to `llama.cpp/ggml*.h`) | |
| In fact, if you wanted to only generate bindings for the current version of the `ggml` repo itself (instead of `llama.cpp`; you'd loose support for k-quants), you could run: | |
| ```bash | |
| API=../../include/ggml/ggml.h python regenerate.py | |
| ``` | |
| ## Develop | |
| Run tests: | |
| ```bash | |
| pytest | |
| ``` | |
| ### Alternatives | |
| This example's goal is to showcase [cffi](https://cffi.readthedocs.io/)-generated bindings that are trivial to use and update, but there are already alternatives in the wild: | |
| - https://github.com/abetlen/ggml-python: these bindings seem to be hand-written and use [ctypes](https://docs.python.org/3/library/ctypes.html). It has [high-quality API reference docs](https://ggml-python.readthedocs.io/en/latest/api-reference/#ggml.ggml) that can be used with these bindings too, but it doesn't expose Metal, CUDA, MPI or OpenCL calls, doesn't support transparent (de/re)quantization like this example does (see [ggml.utils](./ggml/utils.py) module), and won't pick up your local changes. | |
| - https://github.com/abetlen/llama-cpp-python: these expose the C++ `llama.cpp` interface, which this example cannot easily be extended to support (`cffi` only generates bindings of C libraries) | |
| - [pybind11](https://github.com/pybind/pybind11) and [nanobind](https://github.com/wjakob/nanobind) are two alternatives to cffi that support binding C++ libraries, but it doesn't seem either of them have an automatic generator (writing bindings is rather time-consuming). | |