Text Generation
GGUF
English
llama.cpp
code
coding
tool-calling
agent
mixture-of-experts
long-context
imatrix
conversational
Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jedisct1/MiMo-V2.5-coder-Q2", filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- LM Studio
- Jan
- vLLM
How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jedisct1/MiMo-V2.5-coder-Q2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/MiMo-V2.5-coder-Q2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Unsloth Studio new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
- Pi new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/MiMo-V2.5-coder-Q2" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2
Run Hermes
hermes
- Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Lemonade
How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jedisct1/MiMo-V2.5-coder-Q2
Run and chat with the model
lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}List all available models
lemonade list
File size: 9,875 Bytes
f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 9a81410 9ee7296 9a81410 9ee7296 f7e8204 9ee7296 9a81410 9ee7296 c55021c 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 c55021c 9ee7296 9a81410 9ee7296 9a81410 9ee7296 9a81410 9ee7296 f7e8204 9ee7296 f7e8204 e9b4ec5 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 f7e8204 9ee7296 9a81410 f7e8204 9ee7296 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | ---
license: mit
base_model: XiaomiMiMo/MiMo-V2.5
language:
- en
library_name: llama.cpp
tags:
- gguf
- llama.cpp
- text-generation
- code
- coding
- tool-calling
- agent
- mixture-of-experts
- long-context
pipeline_tag: text-generation
---
# MiMo-V2.5 Coder Q2 v2 GGUF
This is a text-only GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling on high-memory local machines.
The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant.
This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference.
## Why this build exists
Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:
- coding in common systems and scripting languages
- web UI/component generation
- OpenAI-compatible tool calling
- agent loops over real files and commands
- long English technical prompts
Chinese-language quality and multimodal behavior were not optimization targets.
## How it was built
The source was the original `XiaomiMiMo/MiMo-V2.5` checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute.
The final artifact is a split `Q2_K_S` GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.
The build was iterative:
1. Convert the original checkpoint to split BF16 GGUF.
2. Produce a first low-bit coding/tool-use candidate.
3. Test that candidate on executable coding tasks and realistic tool-calling agent loops.
4. Add calibration coverage for the failures that showed up in real tests.
5. Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
6. Re-quantize with the final `Q2_K_S` recipe.
The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.
Quantization details:
- Quant type: `Q2_K_S`
- Importance matrix: coding and tool-calling focused
- Embeddings and output tensors kept at higher precision
- Attention and dense first-FFN tensors protected at higher precision
- MoE down-expert tensors kept at `Q3_K`
- Reported size: about 108,496.76 MiB, 2.95 BPW
- Split files: 16 GGUF shards
One tokenizer metadata fix is included: the base-vocabulary `</s>` token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains `<|im_end|>`.
## Why this recipe was chosen
The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.
The first plain `Q2_K` family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:
- embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
- attention tensors are protected because tool-call and code prompts are structure-heavy
- the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
- MoE down-expert tensors use `Q3_K`, which was a better quality/memory tradeoff than pushing all expert down-projections lower
That is why this is still a Q2-class build, but not the smallest possible Q2 build.
## Why it is good at coding
This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.
The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.
The final v2 artifact passed the local coding and web-design harness across:
- Swift
- JavaScript
- TypeScript through Deno
- Rust
- C
- C++
- Zig
- Python
- Perl
- Go
- static HTML/CSS
That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.
It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3.
The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.
## Tool-calling validation
Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was [Swival](https://swival.dev). Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here.
Validation included:
- a broad synthetic selector suite covering a wide tool surface
- real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents
- a real goal-mode run that required the model to complete work and call a final completion tool
The current v2 results were:
- all-tools selector: 22/22
- real one-shot agent suite: 10/10 with zero failed tool calls
- real goal-mode completion call: passed with exactly one successful final call
A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures.
These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked.
Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled `v2`.
## Serving with llama.cpp
Recent llama.cpp builds should be able to load the repo directly:
```sh
llama-server \
-hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 100000 \
--parallel 1 \
--batch-size 512 \
--ubatch-size 128 \
--threads 12 \
--threads-batch 18 \
--prio 0 \
--poll 80 \
--flash-attn on \
--jinja \
--fit on \
--fit-target 4096 \
--fit-ctx 100000 \
--gpu-layers auto \
--cache-type-k f16 \
--cache-type-v f16 \
--reasoning off
```
If you cloned or downloaded the repository locally, you can use the helper script:
```sh
./run-server.sh
```
The helper script loads the first GGUF shard next to it and uses the same default serving profile.
Default settings:
```sh
MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0
```
For more memory headroom, use CPU-MoE mode:
```sh
MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
```
That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.
You can point the script at a specific server binary:
```sh
LLAMA_SERVER=/path/to/llama-server ./run-server.sh
```
## Tool-calling tips
- Disable reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
- Send tool schemas from the client rather than enabling llama.cpp built-in tools.
- Set `parallel_tool_calls` to `false` if your client supports it.
- Avoid forcing `tool_choice: required`; in testing, that made malformed calls more likely.
- Use a client that supports OpenAI-compatible tool calls cleanly.
## License
The upstream `XiaomiMiMo/MiMo-V2.5` model card declares the MIT license. This derived GGUF is provided with the same license metadata.
|