Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / inference-endpoints /pr_151 /en /engines /llama_cpp.md

rtrm

about 2 months ago

preview code

download

raw

5.59 kB

llama.cpp

llama.cpp is a high-performance inference engine written in C/C++, tailored for running Llama and compatible models in the GGUF format.

Core features:

GGUF Model Support: Native compatibility with the GGUF format and all quantization types that comes with it.
Multi-Platform: Optimized for both CPU and GPU execution, with support for AVX, AVX2, AVX512, and CUDA acceleration.
OpenAI-Compatible API: Provides endpoints for chat, completion, embedding, and more, enabling seamless integration with existing tools and workflows.
Active Community and Ecosystem: Rapid development and a rich ecosystem of tools, extensions, and integrations

When you create an endpoint with a GGUF model, a llama.cpp container is automatically selected using the latest image built from the master branch of the llama.cpp repository. Upon successful deployment, a server with an OpenAI-compatible endpoint becomes available.

llama.cpp supports multiple endpoints like /tokenize, /health, /embedding, and many more. For a comprehensive list of available endpoints, please refer to the API documentation.

Deployment Steps

To deploy an endpoint with a llama.cpp container, follow these steps:

Create a new endpoint and select a repository containing a GGUF model. The llama.cpp container will be automatically selected.

Choose the desired GGUF file, noting that memory requirements will vary depending on the selected file. For example, an F16 model requires more memory than a Q4_K_M model.

Select your desired hardware configuration.

Optionally, you can customize the container's configuration settings like Max Tokens, Number of Concurrent Requests. For more information on those, please refer to the Configurations section below.
Click the Create Endpoint button to complete the deployment.

Alternatively, you can follow the video tutorial below for a step-by-step guide on deploying an endpoint with a llama.cpp container:

Xet Storage Details

Size:: 5.59 kB
Xet hash:: e4afbdeed6b4aedfe6556247439a7a6218b1d185c82813509219bad71883fd15

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Buckets:

hf-doc-build
/

doc-dev

llama.cpp

Deployment Steps

Configurations

Basic Configurations

Advanced Configurations

Troubleshooting

Xet Storage Details