Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jedisct1/MiMo-V2.5-coder-Q2",
	filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2

LM Studio
Jan

vLLM

How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jedisct1/MiMo-V2.5-coder-Q2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jedisct1/MiMo-V2.5-coder-Q2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2

Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
```
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
```

Unsloth Studio new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Pi new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jedisct1/MiMo-V2.5-coder-Q2"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2

Run Hermes

hermes

Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
```
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
```

Lemonade

How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jedisct1/MiMo-V2.5-coder-Q2

Run and chat with the model

lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}

List all available models

lemonade list

jedisct1 commited on 1 day ago

Commit

c3ebe02

verified ·

1 Parent(s): eea7d9c

Add files using upload-large-folder tool

Browse files

Files changed (2) hide show

.swival/repl_history +9 -0
README.md +41 -13

.swival/repl_history ADDED Viewed

	@@ -0,0 +1,9 @@

+# 2026-05-25 12:10:06.766878
++hello
+# 2026-05-25 12:11:32.119684
++List my files
+# 2026-05-25 12:20:09.313414
++/new

README.md CHANGED Viewed

@@ -13,16 +13,21 @@ tags:
 - agent
 - mixture-of-experts
 - long-context
 pipeline_tag: text-generation
 ---
-# MiMo-V2.5 Coder Q2 GGUF
-This is a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and tool-calling on a 128 GB Apple Silicon M5 machine.
-This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
-It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.
 ## Quantization
@@ -30,9 +35,10 @@ High-level summary:
 - Quant type: `Q2_K_S`
 - Importance matrix: coding and tool-calling focused
-- Preserved higher precision for embeddings, output, attention, and the dense first FFN
 - MoE down-expert tensors: `Q3_K`
-- Reported quantized size: about 108,496.76 MiB at 2.95 BPW
 One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
@@ -45,13 +51,17 @@ This build deliberately prioritizes:
 - English prompts and codebase work
 - practical inference on a 128 GB Apple Silicon system
 Chinese-language quality and multimodal use were not optimization targets.
 ## Serving
 ```sh
 llama-server \
-  -hf jedisct1/MiMo-V2.5-coder-Q2 \
   --host 127.0.0.1 \
   --port 8080 \
   --ctx-size 100000 \
@@ -70,7 +80,9 @@ llama-server \
   --gpu-layers auto \
   --cache-type-k f16 \
   --cache-type-v f16 \
-  --reasoning off
 ```
 This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
@@ -93,9 +105,11 @@ MIMO_BATCH=512
 MIMO_UBATCH=128
 MIMO_REASONING=off
 MIMO_CPU_MOE=0
 ```
-These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.
 If you hit memory pressure, use the safer CPU-MoE mode:
@@ -115,7 +129,7 @@ You can also run `llama-server` directly against local files without the helper
 ```sh
 llama-server \
-  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
   --host 127.0.0.1 \
   --port 8080 \
   --ctx-size 100000 \
@@ -134,14 +148,16 @@ llama-server \
   --gpu-layers auto \
   --cache-type-k f16 \
   --cache-type-v f16 \
-  --reasoning off
 ```
 For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
 ```sh
 llama-server \
-  --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
   --ctx-size 100000 \
   --fit on \
   --fit-target 32768 \
@@ -154,17 +170,29 @@ llama-server \
   --cache-type-k f16 \
   --cache-type-v f16 \
   --reasoning off \
   --cpu-moe
 ```
 ## Tool-Calling Notes
 For best tool-calling results:
-- Use the [Swival](https://swival.dev) harness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
 - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
 - Set `parallel_tool_calls` to `false` if your client supports it.
 - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
 ## License

 - agent
 - mixture-of-experts
 - long-context
+- mtp
 pipeline_tag: text-generation
 ---
+# MiMo-V2.5 Coder Q2 MTP GGUF
+*Work in progress, please use the non-MTP version for now*
+This is the MTP-included sibling of `MiMo-V2.5-coder-Q2`: a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling.
+This quant was optimized for systems with 128 GB of memory and a 100,000 tokens context size. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
+It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders.
+This variant includes MiMo's three multi-token prediction blocks and is meant for llama.cpp builds with `draft-mtp` speculative decoding support. If you run it as a plain non-speculative model, llama.cpp may report the trailing MTP tensors as unused; that is expected when speculative MTP is disabled.
 ## Quantization
 - Quant type: `Q2_K_S`
 - Importance matrix: coding and tool-calling focused
+- Preserved higher precision for embeddings, output, attention, dense first FFN, and MTP dense/projection tensors
 - MoE down-expert tensors: `Q3_K`
+- Reported quantized size: about 109,026.87 MiB at 2.95 BPW
+- MTP metadata: `mimo2.block_count = 51`, `mimo2.nextn_predict_layers = 3`
 One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
 - English prompts and codebase work
 - practical inference on a 128 GB Apple Silicon system
+The importance matrix was built from an expanded English calibration set with coding, review, shell, and Swival-style tool-use prompts. It used the first runnable local Q2 GGUF as the calibration model and focused on the main text generation path. The MTP tensors were not present in that first-pass calibration matrix, so they were protected manually at `Q4_K` where it mattered most.
 Chinese-language quality and multimodal use were not optimization targets.
 ## Serving
+Most users should start it directly from Hugging Face with llama.cpp:
 ```sh
 llama-server \
+  -hf jedisct1/MiMo-V2.5-coder-Q2-MTP \
   --host 127.0.0.1 \
   --port 8080 \
   --ctx-size 100000 \
   --gpu-layers auto \
   --cache-type-k f16 \
   --cache-type-v f16 \
+  --reasoning off \
+  --spec-type draft-mtp \
+  --spec-draft-n-max 3
 ```
 This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
 MIMO_UBATCH=128
 MIMO_REASONING=off
 MIMO_CPU_MOE=0
+MIMO_SPEC_TYPE=draft-mtp
+MIMO_SPEC_DRAFT_N_MAX=3
 ```
+These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, enable llama.cpp's MTP speculative decoding, and ask llama.cpp to fit as much of the model as possible onto Metal.
 If you hit memory pressure, use the safer CPU-MoE mode:
 ```sh
 llama-server \
+  --model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
   --host 127.0.0.1 \
   --port 8080 \
   --ctx-size 100000 \
   --gpu-layers auto \
   --cache-type-k f16 \
   --cache-type-v f16 \
+  --reasoning off \
+  --spec-type draft-mtp \
+  --spec-draft-n-max 3
 ```
 For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
 ```sh
 llama-server \
+  --model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
   --ctx-size 100000 \
   --fit on \
   --fit-target 32768 \
   --cache-type-k f16 \
   --cache-type-v f16 \
   --reasoning off \
+  --spec-type draft-mtp \
+  --spec-draft-n-max 3 \
   --cpu-moe
 ```
+## MTP Runtime Note
+This GGUF keeps the MTP tensors and the serving examples enable llama.cpp's `draft-mtp` speculative decoder. Plain generation without `--spec-type draft-mtp` can show warnings like `model has unused tensor blk.48...` because the MTP blocks are not part of the normal trunk pass. That warning is expected for non-speculative loads and is not a corrupted-file warning.
+To disable speculative decoding for troubleshooting:
+```sh
+MIMO_SPEC_TYPE=none ./run-server.sh
+```
 ## Tool-Calling Notes
 For best tool-calling results:
 - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
 - Set `parallel_tool_calls` to `false` if your client supports it.
 - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
+- Use request-provided OpenAI tool schemas rather than llama.cpp built-in server tools unless you are intentionally testing those built-ins.
 ## License