Spaces:
Runtime error
Runtime error
| # OpenAI Compatible Server | |
| `llama-cpp-python` offers an OpenAI API compatible web server. | |
| This web server can be used to serve local models and easily connect them to existing clients. | |
| ## Setup | |
| ### Installation | |
| The server can be installed by running the following command: | |
| ```bash | |
| pip install llama-cpp-python[server] | |
| ``` | |
| ### Running the server | |
| The server can then be started by running the following command: | |
| ```bash | |
| python3 -m llama_cpp.server --model <model_path> | |
| ``` | |
| ### Server options | |
| For a full list of options, run: | |
| ```bash | |
| python3 -m llama_cpp.server --help | |
| ``` | |
| NOTE: All server options are also available as environment variables. For example, `--model` can be set by setting the `MODEL` environment variable. | |
| Check out the server config reference below settings for more information on the available options. | |
| CLI arguments and environment variables are available for all of the fields defined in [`ServerSettings`](#llama_cpp.server.settings.ServerSettings) and [`ModelSettings`](#llama_cpp.server.settings.ModelSettings) | |
| Additionally the server supports configuration check out the [configuration section](#configuration-and-multi-model-support) for more information and examples. | |
| ## Guides | |
| ### Code Completion | |
| `llama-cpp-python` supports code completion via GitHub Copilot. | |
| *NOTE*: Without GPU acceleration this is unlikely to be fast enough to be usable. | |
| You'll first need to download one of the available code completion models in GGUF format: | |
| - [replit-code-v1_5-GGUF](https://huggingface.co/abetlen/replit-code-v1_5-3b-GGUF) | |
| Then you'll need to run the OpenAI compatible web server with a increased context size substantially for GitHub Copilot requests: | |
| ```bash | |
| python3 -m llama_cpp.server --model <model_path> --n_ctx 16192 | |
| ``` | |
| Then just update your settings in `.vscode/settings.json` to point to your code completion server: | |
| ```json | |
| { | |
| // ... | |
| "github.copilot.advanced": { | |
| "debug.testOverrideProxyUrl": "http://<host>:<port>", | |
| "debug.overrideProxyUrl": "http://<host>:<port>" | |
| } | |
| // ... | |
| } | |
| ``` | |
| ### Function Calling | |
| `llama-cpp-python` supports structured function calling based on a JSON schema. | |
| Function calling is completely compatible with the OpenAI function calling API and can be used by connecting with the official OpenAI Python client. | |
| You'll first need to download one of the available function calling models in GGUF format: | |
| - [functionary](https://huggingface.co/meetkai) | |
| Then when you run the server you'll need to also specify either `functionary-v1` or `functionary-v2` chat_format. | |
| Note that since functionary requires a HF Tokenizer due to discrepancies between llama.cpp and HuggingFace's tokenizers as mentioned [here](https://github.com/abetlen/llama-cpp-python/blob/main?tab=readme-ov-file#function-calling), you will need to pass in the path to the tokenizer too. The tokenizer files are already included in the respective HF repositories hosting the gguf files. | |
| ```bash | |
| python3 -m llama_cpp.server --model <model_path_to_functionary_v2_model> --chat_format functionary-v2 --hf_pretrained_model_name_or_path <model_path_to_functionary_v2_tokenizer> | |
| ``` | |
| Check out this [example notebook](https://github.com/abetlen/llama-cpp-python/blob/main/examples/notebooks/Functions.ipynb) for a walkthrough of some interesting use cases for function calling. | |
| ### Multimodal Models | |
| `llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to | |
| read information from both text and images. | |
| You'll first need to download one of the available multi-modal models in GGUF format: | |
| - [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b) | |
| - [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b) | |
| - [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1) | |
| - [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf) | |
| - [moondream2](https://huggingface.co/vikhyatk/moondream2) | |
| Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format | |
| ```bash | |
| python3 -m llama_cpp.server --model <model_path> --clip_model_path <clip_model_path> --chat_format llava-1-5 | |
| ``` | |
| Then you can just use the OpenAI API as normal | |
| ```python3 | |
| from openai import OpenAI | |
| client = OpenAI(base_url="http://<host>:<port>/v1", api_key="sk-xxx") | |
| response = client.chat.completions.create( | |
| model="gpt-4-vision-preview", | |
| messages=[ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image_url", | |
| "image_url": { | |
| "url": "<image_url>" | |
| }, | |
| }, | |
| {"type": "text", "text": "What does the image say"}, | |
| ], | |
| } | |
| ], | |
| ) | |
| print(response) | |
| ``` | |
| ## Configuration and Multi-Model Support | |
| The server supports configuration via a JSON config file that can be passed using the `--config_file` parameter or the `CONFIG_FILE` environment variable. | |
| ```bash | |
| python3 -m llama_cpp.server --config_file <config_file> | |
| ``` | |
| Config files support all of the server and model options supported by the cli and environment variables however instead of only a single model the config file can specify multiple models. | |
| The server supports routing requests to multiple models based on the `model` parameter in the request which matches against the `model_alias` in the config file. | |
| At the moment only a single model is loaded into memory at, the server will automatically load and unload models as needed. | |
| ```json | |
| { | |
| "host": "0.0.0.0", | |
| "port": 8080, | |
| "models": [ | |
| { | |
| "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf", | |
| "model_alias": "gpt-3.5-turbo", | |
| "chat_format": "chatml", | |
| "n_gpu_layers": -1, | |
| "offload_kqv": true, | |
| "n_threads": 12, | |
| "n_batch": 512, | |
| "n_ctx": 2048 | |
| }, | |
| { | |
| "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf", | |
| "model_alias": "gpt-4", | |
| "chat_format": "chatml", | |
| "n_gpu_layers": -1, | |
| "offload_kqv": true, | |
| "n_threads": 12, | |
| "n_batch": 512, | |
| "n_ctx": 2048 | |
| }, | |
| { | |
| "model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf", | |
| "model_alias": "gpt-4-vision-preview", | |
| "chat_format": "llava-1-5", | |
| "clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf", | |
| "n_gpu_layers": -1, | |
| "offload_kqv": true, | |
| "n_threads": 12, | |
| "n_batch": 512, | |
| "n_ctx": 2048 | |
| }, | |
| { | |
| "model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf", | |
| "model_alias": "text-davinci-003", | |
| "n_gpu_layers": -1, | |
| "offload_kqv": true, | |
| "n_threads": 12, | |
| "n_batch": 512, | |
| "n_ctx": 2048 | |
| }, | |
| { | |
| "model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf", | |
| "model_alias": "copilot-codex", | |
| "n_gpu_layers": -1, | |
| "offload_kqv": true, | |
| "n_threads": 12, | |
| "n_batch": 1024, | |
| "n_ctx": 9216 | |
| } | |
| ] | |
| } | |
| ``` | |
| The config file format is defined by the [`ConfigFileSettings`](#llama_cpp.server.settings.ConfigFileSettings) class. | |
| ## Server Options Reference | |
| ::: llama_cpp.server.settings.ConfigFileSettings | |
| options: | |
| show_if_no_docstring: true | |
| ::: llama_cpp.server.settings.ServerSettings | |
| options: | |
| show_if_no_docstring: true | |
| ::: llama_cpp.server.settings.ModelSettings | |
| options: | |
| show_if_no_docstring: true | |