Text Generation
GGUF
English
llama.cpp
code
coding
tool-calling
agent
mixture-of-experts
long-context
imatrix
conversational
Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jedisct1/MiMo-V2.5-coder-Q2", filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- LM Studio
- Jan
- vLLM
How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jedisct1/MiMo-V2.5-coder-Q2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/MiMo-V2.5-coder-Q2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Unsloth Studio new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
- Pi new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/MiMo-V2.5-coder-Q2" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2
Run Hermes
hermes
- Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Lemonade
How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jedisct1/MiMo-V2.5-coder-Q2
Run and chat with the model
lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}List all available models
lemonade list
Add files using upload-large-folder tool
Browse files- .swival/repl_history +9 -0
- README.md +41 -13
.swival/repl_history
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
# 2026-05-25 12:10:06.766878
|
| 3 |
+
+hello
|
| 4 |
+
|
| 5 |
+
# 2026-05-25 12:11:32.119684
|
| 6 |
+
+List my files
|
| 7 |
+
|
| 8 |
+
# 2026-05-25 12:20:09.313414
|
| 9 |
+
+/new
|
README.md
CHANGED
|
@@ -13,16 +13,21 @@ tags:
|
|
| 13 |
- agent
|
| 14 |
- mixture-of-experts
|
| 15 |
- long-context
|
|
|
|
| 16 |
pipeline_tag: text-generation
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# MiMo-V2.5 Coder Q2 GGUF
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
This
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Quantization
|
| 28 |
|
|
@@ -30,9 +35,10 @@ High-level summary:
|
|
| 30 |
|
| 31 |
- Quant type: `Q2_K_S`
|
| 32 |
- Importance matrix: coding and tool-calling focused
|
| 33 |
-
- Preserved higher precision for embeddings, output, attention,
|
| 34 |
- MoE down-expert tensors: `Q3_K`
|
| 35 |
-
- Reported quantized size: about
|
|
|
|
| 36 |
|
| 37 |
One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
|
| 38 |
|
|
@@ -45,13 +51,17 @@ This build deliberately prioritizes:
|
|
| 45 |
- English prompts and codebase work
|
| 46 |
- practical inference on a 128 GB Apple Silicon system
|
| 47 |
|
|
|
|
|
|
|
| 48 |
Chinese-language quality and multimodal use were not optimization targets.
|
| 49 |
|
| 50 |
## Serving
|
| 51 |
|
|
|
|
|
|
|
| 52 |
```sh
|
| 53 |
llama-server \
|
| 54 |
-
-hf jedisct1/MiMo-V2.5-coder-Q2 \
|
| 55 |
--host 127.0.0.1 \
|
| 56 |
--port 8080 \
|
| 57 |
--ctx-size 100000 \
|
|
@@ -70,7 +80,9 @@ llama-server \
|
|
| 70 |
--gpu-layers auto \
|
| 71 |
--cache-type-k f16 \
|
| 72 |
--cache-type-v f16 \
|
| 73 |
-
--reasoning off
|
|
|
|
|
|
|
| 74 |
```
|
| 75 |
|
| 76 |
This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
|
|
@@ -93,9 +105,11 @@ MIMO_BATCH=512
|
|
| 93 |
MIMO_UBATCH=128
|
| 94 |
MIMO_REASONING=off
|
| 95 |
MIMO_CPU_MOE=0
|
|
|
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
-
These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.
|
| 99 |
|
| 100 |
If you hit memory pressure, use the safer CPU-MoE mode:
|
| 101 |
|
|
@@ -115,7 +129,7 @@ You can also run `llama-server` directly against local files without the helper
|
|
| 115 |
|
| 116 |
```sh
|
| 117 |
llama-server \
|
| 118 |
-
--model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
|
| 119 |
--host 127.0.0.1 \
|
| 120 |
--port 8080 \
|
| 121 |
--ctx-size 100000 \
|
|
@@ -134,14 +148,16 @@ llama-server \
|
|
| 134 |
--gpu-layers auto \
|
| 135 |
--cache-type-k f16 \
|
| 136 |
--cache-type-v f16 \
|
| 137 |
-
--reasoning off
|
|
|
|
|
|
|
| 138 |
```
|
| 139 |
|
| 140 |
For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
|
| 141 |
|
| 142 |
```sh
|
| 143 |
llama-server \
|
| 144 |
-
--model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
|
| 145 |
--ctx-size 100000 \
|
| 146 |
--fit on \
|
| 147 |
--fit-target 32768 \
|
|
@@ -154,17 +170,29 @@ llama-server \
|
|
| 154 |
--cache-type-k f16 \
|
| 155 |
--cache-type-v f16 \
|
| 156 |
--reasoning off \
|
|
|
|
|
|
|
| 157 |
--cpu-moe
|
| 158 |
```
|
| 159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
## Tool-Calling Notes
|
| 161 |
|
| 162 |
For best tool-calling results:
|
| 163 |
|
| 164 |
-
- Use the [Swival](https://swival.dev) harness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
|
| 165 |
- Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
|
| 166 |
- Set `parallel_tool_calls` to `false` if your client supports it.
|
| 167 |
- Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
|
|
|
|
| 168 |
|
| 169 |
## License
|
| 170 |
|
|
|
|
| 13 |
- agent
|
| 14 |
- mixture-of-experts
|
| 15 |
- long-context
|
| 16 |
+
- mtp
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# MiMo-V2.5 Coder Q2 MTP GGUF
|
| 21 |
|
| 22 |
+
*Work in progress, please use the non-MTP version for now*
|
| 23 |
|
| 24 |
+
This is the MTP-included sibling of `MiMo-V2.5-coder-Q2`: a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling.
|
| 25 |
|
| 26 |
+
This quant was optimized for systems with 128 GB of memory and a 100,000 tokens context size. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
|
| 27 |
+
|
| 28 |
+
It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders.
|
| 29 |
+
|
| 30 |
+
This variant includes MiMo's three multi-token prediction blocks and is meant for llama.cpp builds with `draft-mtp` speculative decoding support. If you run it as a plain non-speculative model, llama.cpp may report the trailing MTP tensors as unused; that is expected when speculative MTP is disabled.
|
| 31 |
|
| 32 |
## Quantization
|
| 33 |
|
|
|
|
| 35 |
|
| 36 |
- Quant type: `Q2_K_S`
|
| 37 |
- Importance matrix: coding and tool-calling focused
|
| 38 |
+
- Preserved higher precision for embeddings, output, attention, dense first FFN, and MTP dense/projection tensors
|
| 39 |
- MoE down-expert tensors: `Q3_K`
|
| 40 |
+
- Reported quantized size: about 109,026.87 MiB at 2.95 BPW
|
| 41 |
+
- MTP metadata: `mimo2.block_count = 51`, `mimo2.nextn_predict_layers = 3`
|
| 42 |
|
| 43 |
One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
|
| 44 |
|
|
|
|
| 51 |
- English prompts and codebase work
|
| 52 |
- practical inference on a 128 GB Apple Silicon system
|
| 53 |
|
| 54 |
+
The importance matrix was built from an expanded English calibration set with coding, review, shell, and Swival-style tool-use prompts. It used the first runnable local Q2 GGUF as the calibration model and focused on the main text generation path. The MTP tensors were not present in that first-pass calibration matrix, so they were protected manually at `Q4_K` where it mattered most.
|
| 55 |
+
|
| 56 |
Chinese-language quality and multimodal use were not optimization targets.
|
| 57 |
|
| 58 |
## Serving
|
| 59 |
|
| 60 |
+
Most users should start it directly from Hugging Face with llama.cpp:
|
| 61 |
+
|
| 62 |
```sh
|
| 63 |
llama-server \
|
| 64 |
+
-hf jedisct1/MiMo-V2.5-coder-Q2-MTP \
|
| 65 |
--host 127.0.0.1 \
|
| 66 |
--port 8080 \
|
| 67 |
--ctx-size 100000 \
|
|
|
|
| 80 |
--gpu-layers auto \
|
| 81 |
--cache-type-k f16 \
|
| 82 |
--cache-type-v f16 \
|
| 83 |
+
--reasoning off \
|
| 84 |
+
--spec-type draft-mtp \
|
| 85 |
+
--spec-draft-n-max 3
|
| 86 |
```
|
| 87 |
|
| 88 |
This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
|
|
|
|
| 105 |
MIMO_UBATCH=128
|
| 106 |
MIMO_REASONING=off
|
| 107 |
MIMO_CPU_MOE=0
|
| 108 |
+
MIMO_SPEC_TYPE=draft-mtp
|
| 109 |
+
MIMO_SPEC_DRAFT_N_MAX=3
|
| 110 |
```
|
| 111 |
|
| 112 |
+
These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, enable llama.cpp's MTP speculative decoding, and ask llama.cpp to fit as much of the model as possible onto Metal.
|
| 113 |
|
| 114 |
If you hit memory pressure, use the safer CPU-MoE mode:
|
| 115 |
|
|
|
|
| 129 |
|
| 130 |
```sh
|
| 131 |
llama-server \
|
| 132 |
+
--model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
|
| 133 |
--host 127.0.0.1 \
|
| 134 |
--port 8080 \
|
| 135 |
--ctx-size 100000 \
|
|
|
|
| 148 |
--gpu-layers auto \
|
| 149 |
--cache-type-k f16 \
|
| 150 |
--cache-type-v f16 \
|
| 151 |
+
--reasoning off \
|
| 152 |
+
--spec-type draft-mtp \
|
| 153 |
+
--spec-draft-n-max 3
|
| 154 |
```
|
| 155 |
|
| 156 |
For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
|
| 157 |
|
| 158 |
```sh
|
| 159 |
llama-server \
|
| 160 |
+
--model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
|
| 161 |
--ctx-size 100000 \
|
| 162 |
--fit on \
|
| 163 |
--fit-target 32768 \
|
|
|
|
| 170 |
--cache-type-k f16 \
|
| 171 |
--cache-type-v f16 \
|
| 172 |
--reasoning off \
|
| 173 |
+
--spec-type draft-mtp \
|
| 174 |
+
--spec-draft-n-max 3 \
|
| 175 |
--cpu-moe
|
| 176 |
```
|
| 177 |
|
| 178 |
+
## MTP Runtime Note
|
| 179 |
+
|
| 180 |
+
This GGUF keeps the MTP tensors and the serving examples enable llama.cpp's `draft-mtp` speculative decoder. Plain generation without `--spec-type draft-mtp` can show warnings like `model has unused tensor blk.48...` because the MTP blocks are not part of the normal trunk pass. That warning is expected for non-speculative loads and is not a corrupted-file warning.
|
| 181 |
+
|
| 182 |
+
To disable speculative decoding for troubleshooting:
|
| 183 |
+
|
| 184 |
+
```sh
|
| 185 |
+
MIMO_SPEC_TYPE=none ./run-server.sh
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
## Tool-Calling Notes
|
| 189 |
|
| 190 |
For best tool-calling results:
|
| 191 |
|
|
|
|
| 192 |
- Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
|
| 193 |
- Set `parallel_tool_calls` to `false` if your client supports it.
|
| 194 |
- Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
|
| 195 |
+
- Use request-provided OpenAI tool schemas rather than llama.cpp built-in server tools unless you are intentionally testing those built-ins.
|
| 196 |
|
| 197 |
## License
|
| 198 |
|