Instructions to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="QuantFactory/Ministral-8B-Instruct-2410-GGUF",
	filename="Ministral-8B-Instruct-2410.Q2_K.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Use Docker

docker model run hf.co/QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Ollama:
```
ollama run hf.co/QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
```

Unsloth Studio

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for QuantFactory/Ministral-8B-Instruct-2410-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for QuantFactory/Ministral-8B-Instruct-2410-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for QuantFactory/Ministral-8B-Instruct-2410-GGUF to start chatting

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Docker Model Runner:
```
docker model run hf.co/QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M
```

Lemonade

How to use QuantFactory/Ministral-8B-Instruct-2410-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull QuantFactory/Ministral-8B-Instruct-2410-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Ministral-8B-Instruct-2410-GGUF-Q4_K_M

List all available models

lemonade list

QuantFactory/Ministral-8B-Instruct-2410-GGUF

This is quantized version of mistralai/Ministral-8B-Instruct-2410 created using llama.cpp

Original Model Card

Model Card for Ministral-8B-Instruct-2410

We introduce two new state-of-the-art models for local intelligence, on-device computing, and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.

The Ministral-8B-Instruct-2410 Language Model is an instruct fine-tuned model significantly outperforming existing models of similar size, released under the Mistral Research License.

If you are interested in using Ministral-3B or Ministral-8B commercially, outperforming Mistral-7B, reach out to us.

For more details about les Ministraux please refer to our release blog post.

Ministral 8B Key features

Released under the Mistral Research License, reach out to us for a commercial license
Trained with a 128k context window with interleaved sliding-window attention
Trained on a large proportion of multilingual and code data
Supports function calling
Vocabulary size of 131k, using the V3-Tekken tokenizer

Basic Instruct Template (V3-Tekken)

<s>[INST]user message[/INST]assistant response</s>[INST]new user message[/INST]

For more information about the tokenizer please refer to mistral-common

Ministral 8B Architecture

Feature	Value
Architecture	Dense Transformer
Parameters	8,019,808,256
Layers	36
Heads	32
Dim	4096
KV Heads (GQA)	8
Hidden Dim	12288
Head Dim	128
Vocab Size	131,072
Context Length	128k
Attention Pattern	Ragged (128k,32k,32k,32k)

Benchmarks

Base Models

Knowledge & Commonsense

Model	MMLU	AGIEval	Winogrande	Arc-c	TriviaQA
Mistral 7B Base	62.5	42.5	74.2	67.9	62.5
Llama 3.1 8B Base	64.7	44.4	74.6	46.0	60.2
Ministral 8B Base	65.0	48.3	75.3	71.9	65.5

Gemma 2 2B Base	52.4	33.8	68.7	42.6	47.8
Llama 3.2 3B Base	56.2	37.4	59.6	43.1	50.7
Ministral 3B Base	60.9	42.1	72.7	64.2	56.7

Code & Math

Model	HumanEval pass@1	GSM8K maj@8
Mistral 7B Base	26.8	32.0
Llama 3.1 8B Base	37.8	42.2
Ministral 8B Base	34.8	64.5

Gemma 2 2B	20.1	35.5
Llama 3.2 3B	14.6	33.5
Ministral 3B	34.2	50.9

Multilingual

Model	French MMLU	German MMLU	Spanish MMLU
Mistral 7B Base	50.6	49.6	51.4
Llama 3.1 8B Base	50.8	52.8	54.6
Ministral 8B Base	57.5	57.4	59.6

Gemma 2 2B Base	41.0	40.1	41.7
Llama 3.2 3B Base	42.3	42.2	43.1
Ministral 3B Base	49.1	48.3	49.5

Instruct Models

Chat/Arena (gpt-4o judge)

Model	MTBench	Arena Hard	Wild bench
Mistral 7B Instruct v0.3	6.7	44.3	33.1
Llama 3.1 8B Instruct	7.5	62.4	37.0
Gemma 2 9B Instruct	7.6	68.7	43.8
Ministral 8B Instruct	8.3	70.9	41.3

Gemma 2 2B Instruct	7.5	51.7	32.5
Llama 3.2 3B Instruct	7.2	46.0	27.2
Ministral 3B Instruct	8.1	64.3	36.3

Code & Math

Model	MBPP pass@1	HumanEval pass@1	Math maj@1
Mistral 7B Instruct v0.3	50.2	38.4	13.2
Gemma 2 9B Instruct	68.5	67.7	47.4
Llama 3.1 8B Instruct	69.7	67.1	49.3
Ministral 8B Instruct	70.0	76.8	54.5

Gemma 2 2B Instruct	54.5	42.7	22.8
Llama 3.2 3B Instruct	64.6	61.0	38.4
Ministral 3B Instruct	67.7	77.4	51.7

Function calling

Model	Internal bench
Mistral 7B Instruct v0.3	6.9
Llama 3.1 8B Instruct	N/A
Gemma 2 9B Instruct	N/A
Ministral 8B Instruct	31.6

Gemma 2 2B Instruct	N/A
Llama 3.2 3B Instruct	N/A
Ministral 3B Instruct	28.4

Usage Examples

vLLM (recommended)

We recommend using this model with the vLLM library to implement production-ready inference pipelines.

Currently vLLM is capped at 32k context size because interleaved attention kernels for paged attention are not yet implemented in vLLM. Attention kernels for paged attention are being worked on and as soon as it is fully supported in vLLM, this model card will be updated. To take advantage of the full 128k context size we recommend Mistral Inference

Installation

Make sure you install vLLM >= v0.6.2:

pip install --upgrade vllm

Also make sure you have mistral_common >= 1.4.4 installed:

pip install --upgrade mistral_common

You can also make use of a ready-to-go docker image.

Offline

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Ministral-8B-Instruct-2410"

sampling_params = SamplingParams(max_tokens=8192)

# note that running Ministral 8B on a single GPU requires 24 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")

prompt = "Do we need to think for 10 seconds to find the answer of 1 + 1?"

messages = [
    {
        "role": "user",
        "content": prompt
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
# You don't need to think for 10 seconds to find the answer to 1 + 1. The answer is 2,
# and you can easily add these two numbers in your mind very quickly without any delay.

Server

You can also use Ministral-8B in a server/client setting.

Spin up a server:

vllm serve mistralai/Ministral-8B-Instruct-2410 --tokenizer_mode mistral --config_format mistral --load_format mistral

Note: Running Ministral-8B on a single GPU requires 24 GB of GPU RAM.

If you want to divide the GPU requirement over multiple devices, please add e.g. --tensor_parallel=2

And ping the client:

curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
    "model": "mistralai/Ministral-8B-Instruct-2410",
    "messages": [
      {
        "role": "user",
        "content": "Do we need to think for 10 seconds to find the answer of 1 + 1?"
      }
    ]
}'

Mistral-inference

We recommend using mistral-inference to quickly try out / "vibe-check" the model.

Install

Make sure to have mistral_inference >= 1.5.0 installed.

pip install mistral_inference --upgrade

Download

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', '8B-Instruct')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Ministral-8B-Instruct-2410", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path)

Chat

After installing mistral_inference, a mistral-chat CLI command should be available in your environment. You can chat with the model using

mistral-chat $HOME/mistral_models/8B-Instruct --instruct --max_tokens 256

Passkey detection

In this example the passkey message has over >100k tokens and mistral-inference does not have a chunked pre-fill mechanism. Therefore you will need a lot of GPU memory in order to run the below example (80 GB). For a more memory-efficient solution we recommend using vLLM.

from mistral_inference.transformer import Transformer
from pathlib import Path
import json
from mistral_inference.generate import generate
from huggingface_hub import hf_hub_download

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

def load_passkey_request() -> ChatCompletionRequest:
    passkey_file = hf_hub_download(repo_id="mistralai/Ministral-8B-Instruct-2410", filename="passkey_example.json")

    with open(passkey_file, "r") as f:
        data = json.load(f)

    message_content = data["messages"][0]["content"]
    return ChatCompletionRequest(messages=[UserMessage(content=message_content)])

tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, softmax_fp32=False)

completion_request = load_passkey_request()

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)  # The pass key is 13005.

Instruct following

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(messages=[UserMessage(content="How often does the letter r occur in Mistral?")])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

Function calling

from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.tekken import SpecialTokenPolicy


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
tekken = tokenizer.instruct_tokenizer.tokenizer
tekken.special_token_policy = SpecialTokenPolicy.IGNORE

model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

The Mistral AI Team

Albert Jiang, Alexandre Abou Chahine, Alexandre Sablayrolles, Alexis Tacnet, Alodie Boissonnet, Alok Kothari, Amélie Héliou, Andy Lo, Anna Peronnin, Antoine Meunier, Antoine Roux, Antonin Faure, Aritra Paul, Arthur Darcet, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Avinash Sooriyarachchi, Baptiste Rozière, Barry Conklin, Bastien Bouillon, Blanche Savary de Beauregard, Carole Rambaud, Caroline Feldman, Charles de Freminville, Charline Mauro, Chih-Kuan Yeh, Chris Bamford, Clement Auguy, Corentin Heintz, Cyriaque Dubois, Devendra Singh Chaplot, Diego Las Casas, Diogo Costa, Eléonore Arcelin, Emma Bou Hanna, Etienne Metzger, Fanny Olivier Autran, Francois Lesage, Garance Gourdel, Gaspard Blanchet, Gaspard Donada Vidal, Gianna Maria Lengyel, Guillaume Bour, Guillaume Lample, Gustave Denis, Harizo Rajaona, Himanshu Jaju, Ian Mack, Ian Mathew, Jean-Malo Delignon, Jeremy Facchetti, Jessica Chudnovsky, Joachim Studnia, Justus Murke, Kartik Khandelwal, Kenneth Chiu, Kevin Riera, Leonard Blier, Leonard Suslian, Leonardo Deschaseaux, Louis Martin, Louis Ternon, Lucile Saulnier, Lélio Renard Lavaud, Sophia Yang, Margaret Jennings, Marie Pellat, Marie Torelli, Marjorie Janiewicz, Mathis Felardos, Maxime Darrin, Michael Hoff, Mickaël Seznec, Misha Jessel Kenyon, Nayef Derwiche, Nicolas Carmont Zaragoza, Nicolas Faurie, Nicolas Moreau, Nicolas Schuhl, Nikhil Raghuraman, Niklas Muhs, Olivier de Garrigues, Patricia Rozé, Patricia Wang, Patrick von Platen, Paul Jacob, Pauline Buche, Pavankumar Reddy Muddireddy, Perry Savas, Pierre Stock, Pravesh Agrawal, Renaud de Peretti, Romain Sauvestre, Romain Sinthe, Roman Soletskyi, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Soham Ghosh, Sylvain Regnier, Szymon Antoniak, Teven Le Scao, Theophile Gervet, Thibault Schueller, Thibaut Lavril, Thomas Wang, Timothée Lacroix, Valeriia Nemychnikova, Wendy Shang, William El Sayed, William Marshall

Downloads last month: 294

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support