Instructions to use bartowski/Mistral-Large-Instruct-2407-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bartowski/Mistral-Large-Instruct-2407-GGUF", filename="Mistral-Large-Instruct-2407-IQ1_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Use Docker
docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bartowski/Mistral-Large-Instruct-2407-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartowski/Mistral-Large-Instruct-2407-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
- Ollama
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Ollama:
ollama run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
- Unsloth Studio new
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting
- Pi new
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Docker Model Runner:
docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
- Lemonade
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Mistral-Large-Instruct-2407-GGUF-Q4_K_M
List all available models
lemonade list
Can you please provide the command to change the context size?
Can you please provide the command to change the context size to avoid redownloading the whole model for those with limited/slow bandwidth?
I'm confused? Is there something special about this model handles context size? Usually you set that based on your inference engine at runtime. You can't do much about whatever was used for training of the model.
For example. in llama.cpp you would do:
Context Size
-c N, --ctx-size N: Set the size of the prompt context (default: 0, 0 = loaded from model). The LLaMA models were built with a context of 2048-8192, which will yield the best results on longer input/inference.
There is the manual setting and there is the default value.
What's wrong here is the default value, in a UI like LM-Studio it is read automatically from the metadata inside the gguf.
This is why you'll notice the model was reuploaded yesterday (the Q4_K_M as an example)
Ahh I understand. Is there a way to edit the metadata of an existing GGUF to fix the bug? Good question... Here is what I've found:
- GGUF File Format - including
[llm].context_length: uint64: Also known as n_ctx. length of the context (in tokens) that the model was trained on. For most architectures, this is the hard limit on the length of the input. Architectures, like RWKV, that are not reliant on transformer-style attention may be able to handle larger inputs, but this is not guaranteed.
- GGUF CLI Tool - doesn't seem to have an "update" method.
- I asked the model itself how to do it. I used this "system" and "user" prompt before copy/pasting the file format above for context.
You are an experienced software developer with many years experience writing programs and scripts in bash, python, and Linux. Assist the user generating high quality professional well commented code.
How can I edit a GGUF formmated file and change the n_ctx to increase the default context length? Is there an existing program or must I use a hexeditor?
<copy/paste GGUF spec>
<repeat user prompt>
Editing a GGUF (GGML Universal Format) file to change the context length (`n_ctx`) or any other metadata typically requires a tool that understands the format. Using a hex editor is not recommended because it can easily corrupt the file if not done correctly.
As of my last update in October 2023, there isn't a widely known or standardized tool specifically for editing GGUF files. However, you can use the following general approach to modify the metadata:
1. **Understand the Format**: Familiarize yourself with the GGUF format specification. This will help you locate the metadata section where `n_ctx` is stored.
2. **Use a Script or Tool**: If you have access to the original conversion script or tool that created the GGUF file, you can modify the script to change the `n_ctx` value and then re-run the conversion.
3. **Custom Script**: Write a custom script in a language like Python to read the GGUF file, modify the `n_ctx` value, and then write the modified file back. This requires understanding the binary format and how the metadata is stored.
Here's a high-level outline of what a custom script might look like in Python:
<some generic struct packing JSON decoding python code>
I'm curious if you find an answer better than backing up the file and fussing with a hex editor lol...
Ah sorry I missed this!
you'll need llama.cpp locally
Run:
python3 llama.cpp/gguf-py/scripts/gguf_set_metadata.py /models/Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00002.gguf llama.context_length 131072 --force
if you have a model that's split, target the first part as I did above. If it's not split, just target the full model file.