Source Mistral 7B model:
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/
This model is converted from Bfloat16 datatype to Int8 datatype with convert tool from:
https://github.com/ggerganov/llama.cpp
Deployment on CPU:
Pull the ready-made llama.cpp container:

docker pull ghcr.io/ggerganov/llama.cpp:server

Assuming mistral-7B-instruct-v0.2-q8.gguf file is downloaded to /path/to/models directory on local machine, run the container accesing the model with:

docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/istral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512

Test the deployment accessing the model with the browser at http://localhost:8000
llama.cpp server also provides OpenAI compatible API
Deployment on CUDA GPU:

docker pull ghcr.io/ggerganov/llama.cpp:server-cuda

docker run --gpus all -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server-cuda -m /models/mistral-7B-instruct-v0.2-q8.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 50

If CUDA GPU with 16GB RAM is available, the version of the model converted to float16 may be interesting, available in this repo:
https://huggingface.co/itod/mistral-7B-instruct-v0.2-f16
More details about usage is avalable in llama.cpp documentation:
https://github.com/ggerganov/llama.cpp/tree/master/examples/server

Downloads last month: 6

GGUF

Model size

7B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support