Instructions to use bartowski/Mistral-Large-Instruct-2407-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bartowski/Mistral-Large-Instruct-2407-GGUF",
	filename="Mistral-Large-Instruct-2407-IQ1_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Use Docker

docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bartowski/Mistral-Large-Instruct-2407-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartowski/Mistral-Large-Instruct-2407-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Ollama
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Ollama:
```
ollama run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
```

Unsloth Studio new

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bartowski/Mistral-Large-Instruct-2407-GGUF to start chatting

Pi new

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Docker Model Runner:
```
docker model run hf.co/bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M
```

Lemonade

How to use bartowski/Mistral-Large-Instruct-2407-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bartowski/Mistral-Large-Instruct-2407-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Mistral-Large-Instruct-2407-GGUF-Q4_K_M

List all available models

lemonade list

Can you please provide the command to change the context size?

by yehiaserag - opened Jul 28, 2024

Discussion

yehiaserag

Jul 28, 2024

Can you please provide the command to change the context size to avoid redownloading the whole model for those with limited/slow bandwidth?

ubergarm

Jul 28, 2024

@yehiaserag

I'm confused? Is there something special about this model handles context size? Usually you set that based on your inference engine at runtime. You can't do much about whatever was used for training of the model.

For example. in llama.cpp you would do:

Context Size

    -c N, --ctx-size N: Set the size of the prompt context (default: 0, 0 = loaded from model). The LLaMA models were built with a context of 2048-8192, which will yield the best results on longer input/inference.

yehiaserag

Jul 28, 2024

•

edited Jul 28, 2024

There is the manual setting and there is the default value.
What's wrong here is the default value, in a UI like LM-Studio it is read automatically from the metadata inside the gguf.
This is why you'll notice the model was reuploaded yesterday (the Q4_K_M as an example)

ubergarm

Jul 28, 2024

Ahh I understand. Is there a way to edit the metadata of an existing GGUF to fix the bug? Good question... Here is what I've found:

GGUF File Format - including

[llm].context_length: uint64: Also known as n_ctx. length of the context (in tokens) that the model was trained on. For most architectures, this is the hard limit on the length of the input. Architectures, like RWKV, that are not reliant on transformer-style attention may be able to handle larger inputs, but this is not guaranteed.

GGUF CLI Tool - doesn't seem to have an "update" method.
I asked the model itself how to do it. I used this "system" and "user" prompt before copy/pasting the file format above for context.

You are an experienced software developer with many years experience writing programs and scripts in bash, python, and Linux. Assist the user generating high quality professional well commented code.

How can I edit a GGUF formmated file and change the n_ctx to increase the default context length? Is there an existing program or must I use a hexeditor?
<copy/paste GGUF spec>
<repeat user prompt>

Editing a GGUF (GGML Universal Format) file to change the context length (`n_ctx`) or any other metadata typically requires a tool that understands the format. Using a hex editor is not recommended because it can easily corrupt the file if not done correctly.

As of my last update in October 2023, there isn't a widely known or standardized tool specifically for editing GGUF files. However, you can use the following general approach to modify the metadata:

1. **Understand the Format**: Familiarize yourself with the GGUF format specification. This will help you locate the metadata section where `n_ctx` is stored.

2. **Use a Script or Tool**: If you have access to the original conversion script or tool that created the GGUF file, you can modify the script to change the `n_ctx` value and then re-run the conversion.

3. **Custom Script**: Write a custom script in a language like Python to read the GGUF file, modify the `n_ctx` value, and then write the modified file back. This requires understanding the binary format and how the metadata is stored.

Here's a high-level outline of what a custom script might look like in Python:
<some generic struct packing JSON decoding python code>

I'm curious if you find an answer better than backing up the file and fussing with a hex editor lol...

bartowski

Owner Jul 29, 2024

•

edited Jul 29, 2024

Ah sorry I missed this!

you'll need llama.cpp locally

Run:

python3 llama.cpp/gguf-py/scripts/gguf_set_metadata.py /models/Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00002.gguf llama.context_length 131072 --force

if you have a model that's split, target the first part as I did above. If it's not split, just target the full model file.

yehiaserag

Jul 30, 2024

Thanks a lot @bartowski !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment