Instructions to use bartowski/Mistral-Small-Instruct-2409-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bartowski/Mistral-Small-Instruct-2409-GGUF", filename="Mistral-Small-Instruct-2409-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Use Docker
docker model run hf.co/bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bartowski/Mistral-Small-Instruct-2409-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartowski/Mistral-Small-Instruct-2409-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
- Ollama
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Ollama:
ollama run hf.co/bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
- Unsloth Studio new
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/Mistral-Small-Instruct-2409-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/Mistral-Small-Instruct-2409-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bartowski/Mistral-Small-Instruct-2409-GGUF to start chatting
- Pi new
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Docker Model Runner:
docker model run hf.co/bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
- Lemonade
How to use bartowski/Mistral-Small-Instruct-2409-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bartowski/Mistral-Small-Instruct-2409-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Mistral-Small-Instruct-2409-GGUF-Q4_K_M
List all available models
lemonade list
Possibly the provided prompt format is wrong.
Hi!
Thanks for the very quick quants. This model is really great, however apparently there is a big misunderstanding around the new Mistral prompt format. (Also it is differ from the official Mistral description as well)
Here is my reddit post about it:
https://www.reddit.com/r/LocalLLaMA/comments/1fjb4i5/mistralsmallinstruct2409_is_actually_really/
Marinara also confirmed my theory a few weeks ago. (You can find it in the model description)
https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B-GGUF
The correct one should be:
<s>[INST] user message[/INST] assistant message</s>[INST] new user message[/INST]
Another source:
https://community.aws/content/2dFNOnLVQRhyrOrMsloofnW0ckZ/how-to-prompt-mistral-ai-models-and-why
I tested it with your and our version as well. Nemo and this model is way more coherent and "clever" with the suggested format.
With yours it was broken in many of my tests. (More details in the reddit post).
I can confirm this with the older mistral nemo based models (still d/l'ing this one, presumably it will be the same).
God, I wish Mistral used a better prompt format
I just throw what the actual tokenizer chat template compiles to, hence it <s> at the start, and I assume the Jinja will handle the rest properly, which it looks like it will?
I can't speak to whether the system prompt should get its own response, that feels like just multi turn prompting and suggests that a system message just isn't supported
Otherwise I see no difference in the chat template provided vs the one in the AWS link
God, I wish Mistral used a better prompt format
You don't need a better prompt format, if you just use the model's original tokenizer.
Not sure how GGUF people handle this issue, but I was able to make a quick python using the transformer's library to instantiate the toeknizer from here:
- https://huggingface.co/mistralai/Mistral-Small-Instruct-2409/blob/main/tokenizer.json
- https://huggingface.co/mistralai/Mistral-Small-Instruct-2409/blob/main/tokenizer.model
and if you want to use the v3 tokenizer, you can use the same JSON, but instead, with this model:
That will allow you to never care about the prompt format.
Also, using good inference engines, you can usually have both a completions endpoint (no tokenizer, needs you to define prompt format) and the chat/completions endpoints (which is using the tokenizer, and does not need you to specify the prompt format.)
Made a prompt Jinja2 template here to support un - user/assistant/user/assistant... sequence by glue continues role's messages together.
{{- '<s>' }}
{%- for message in messages %}
{%- set prev_message = messages[loop.index0 - 1] if not loop.first else None %}
{%- set next_message = messages[loop.index] if not loop.last else None %}
{%- if message['role'] != 'assistant' %}
{%- if not prev_message or prev_message['role'] == 'assistant' %}
{{- '[INST] ' }}
{%- endif %}
{{- message['content'] }}
{%- if not next_message or next_message['role'] == 'assistant' %}
{{- '[/INST]' }}
{%- elif message['role'] == 'system' %}
{{- '\n\n' }}
{%- else %}
{{- '\n' }}
{%- endif %}
{%- elif message['role'] == 'assistant' %}
{%- if loop.first %}
{{- '[INST] [/INST]' }}
{%- endif %}
{{- ' ' + message['content'] }}
{%- if next_message and next_message['role'] != 'assistant' %}
{{- '</s>' }}
{%- else %}
{{- '</s>[INST] [/INST]' }}
{%- endif %}
{%- endif %}
{%- endfor %}
@vevi33
Hi there! Actually, the v3 should look more like:<s>[INST] user message[/INST] assistant message</s>[INST] new user message[/INST]
For more deep explanations: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md
@pandora-s
Thank you for the clarification!
I purposed basically this if I am not wrong, but I corrected my post according to your link, the be exactly the same and to not confuse anyone!
Thanks for everyone for being helpful and make this topic finally clear in the community!
<s>[INST] user message[/INST] assistant message</s>[INST] new user message[/INST]
For llamacpp prompt template will be like that
--in-prefix "</s>[INST] " --in-suffix "[/INST] " -p "<s>[INST] You are a helpful assistant.[/INST]"
Hi there! Actually, the v3 should look more like:
<s>[INST] user message[/INST] assistant message</s>[INST] new user message[/INST]
For more deep explanations: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md
@pandora-s Just to clarify: what you've written here is the format one should use for Mistral-Small-Instruct-2409, right?
Hi!
Thanks for the very quick quants. This model is really great, however apparently there is a big misunderstanding around the new Mistral prompt format. (Also it is differ from the official Mistral description as well)Here is my reddit post about it:
https://www.reddit.com/r/LocalLLaMA/comments/1fjb4i5/mistralsmallinstruct2409_is_actually_really/
Marinara also confirmed my theory a few weeks ago. (You can find it in the model description)
https://huggingface.co/MarinaraSpaghetti/NemoMix-Unleashed-12B-GGUFThe correct one should be:
<s>[INST] user message[/INST] assistant message</s>[INST] new user message[/INST]Another source:
https://community.aws/content/2dFNOnLVQRhyrOrMsloofnW0ckZ/how-to-prompt-mistral-ai-models-and-whyI tested it with your and our version as well. Nemo and this model is way more coherent and "clever" with the suggested format.
With yours it was broken in many of my tests. (More details in the reddit post).
I used https://huggingface.co/MarinaraSpaghetti/SillyTavern-Settings
Awesome! Thanks! It really does contribute a lot... in everything, logic, prose, immersion... incredible.
I'm using Marinara's presets too and they make a world of difference far as rp is concerned with Mistral models.
Just to clarify: what you've written here is the format one should use for Mistral-Small-Instruct-2409, right?
@ddh0 yes, the original Small repo was fixed a few hours ago with the correct template, sorry for the trouble!