Buckets:
llama.cpp
llama.cpp is a high-performance inference engine written in C/C++, tailored for running Llama and compatible models in the GGUF format.
Core features:
- GGUF Model Support: Native compatibility with the GGUF format and all quantization types that comes with it.
- Multi-Platform: Optimized for both CPU and GPU execution, with support for AVX, AVX2, AVX512, and CUDA acceleration.
- OpenAI-Compatible API: Provides endpoints for chat, completion, embedding, and more, enabling seamless integration with existing tools and workflows.
- Active Community and Ecosystem: Rapid development and a rich ecosystem of tools, extensions, and integrations
When you create an endpoint with a GGUF model,
a llama.cpp container is automatically selected
using the latest image built from the master branch of the llama.cpp repository.
Upon successful deployment, a server with an OpenAI-compatible endpoint becomes available.
llama.cpp supports multiple endpoints like /tokenize, /health, /embedding, and many more. For a comprehensive list of available endpoints, please refer to the API documentation.
Deployment Steps
To deploy an endpoint with a llama.cpp container, follow these steps:
- Create a new endpoint and select a repository containing a GGUF model. The llama.cpp container will be automatically selected.
- Choose the desired GGUF file, noting that memory requirements will vary depending on the selected file. For example, an F16 model requires more memory than a Q4_K_M model.
- Select your desired hardware configuration.
Optionally, you can customize the container's configuration settings like
Max Tokens,Number of Concurrent Requests. For more information on those, please refer to the Configurations section below.Click the Create Endpoint button to complete the deployment.
Alternatively, you can follow the video tutorial below for a step-by-step guide on deploying an endpoint with a llama.cpp container:
Xet Storage Details
- Size:
- 5.59 kB
- Xet hash:
- e4afbdeed6b4aedfe6556247439a7a6218b1d185c82813509219bad71883fd15
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.