Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jedisct1/MiMo-V2.5-coder-Q2",
	filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2

LM Studio
Jan

vLLM

How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jedisct1/MiMo-V2.5-coder-Q2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jedisct1/MiMo-V2.5-coder-Q2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2

Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
```
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
```

Unsloth Studio new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting

Pi new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jedisct1/MiMo-V2.5-coder-Q2"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2

Run Hermes

hermes

Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
```
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
```

Lemonade

How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jedisct1/MiMo-V2.5-coder-Q2

Run and chat with the model

lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}

List all available models

lemonade list

MiMo-V2.5-coder-Q2

File size: 9,875 Bytes

---
license: mit
base_model: XiaomiMiMo/MiMo-V2.5
language:
- en
library_name: llama.cpp
tags:
- gguf
- llama.cpp
- text-generation
- code
- coding
- tool-calling
- agent
- mixture-of-experts
- long-context
pipeline_tag: text-generation
---

# MiMo-V2.5 Coder Q2 v2 GGUF

This is a text-only GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling on high-memory local machines.

The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant.

This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference.

## Why this build exists

Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:

- coding in common systems and scripting languages
- web UI/component generation
- OpenAI-compatible tool calling
- agent loops over real files and commands
- long English technical prompts

Chinese-language quality and multimodal behavior were not optimization targets.

## How it was built

The source was the original `XiaomiMiMo/MiMo-V2.5` checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute.

The final artifact is a split `Q2_K_S` GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.

The build was iterative:

1. Convert the original checkpoint to split BF16 GGUF.
2. Produce a first low-bit coding/tool-use candidate.
3. Test that candidate on executable coding tasks and realistic tool-calling agent loops.
4. Add calibration coverage for the failures that showed up in real tests.
5. Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
6. Re-quantize with the final `Q2_K_S` recipe.

The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.

Quantization details:

- Quant type: `Q2_K_S`
- Importance matrix: coding and tool-calling focused
- Embeddings and output tensors kept at higher precision
- Attention and dense first-FFN tensors protected at higher precision
- MoE down-expert tensors kept at `Q3_K`
- Reported size: about 108,496.76 MiB, 2.95 BPW
- Split files: 16 GGUF shards

One tokenizer metadata fix is included: the base-vocabulary `</s>` token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains `<|im_end|>`.

## Why this recipe was chosen

The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.

The first plain `Q2_K` family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:

- embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
- attention tensors are protected because tool-call and code prompts are structure-heavy
- the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
- MoE down-expert tensors use `Q3_K`, which was a better quality/memory tradeoff than pushing all expert down-projections lower

That is why this is still a Q2-class build, but not the smallest possible Q2 build.

## Why it is good at coding

This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.

The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.

The final v2 artifact passed the local coding and web-design harness across:

- Swift
- JavaScript
- TypeScript through Deno
- Rust
- C
- C++
- Zig
- Python
- Perl
- Go
- static HTML/CSS

That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.

It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3.

The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.

## Tool-calling validation

Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was [Swival](https://swival.dev). Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here.

Validation included:

- a broad synthetic selector suite covering a wide tool surface
- real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents
- a real goal-mode run that required the model to complete work and call a final completion tool

The current v2 results were:

- all-tools selector: 22/22
- real one-shot agent suite: 10/10 with zero failed tool calls
- real goal-mode completion call: passed with exactly one successful final call

A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures.

These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked.

Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled `v2`.

## Serving with llama.cpp

Recent llama.cpp builds should be able to load the repo directly:

```sh
llama-server \
  -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 100000 \
  --parallel 1 \
  --batch-size 512 \
  --ubatch-size 128 \
  --threads 12 \
  --threads-batch 18 \
  --prio 0 \
  --poll 80 \
  --flash-attn on \
  --jinja \
  --fit on \
  --fit-target 4096 \
  --fit-ctx 100000 \
  --gpu-layers auto \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --reasoning off
```

If you cloned or downloaded the repository locally, you can use the helper script:

```sh
./run-server.sh
```

The helper script loads the first GGUF shard next to it and uses the same default serving profile.

Default settings:

```sh
MIMO_CTX=100000
MIMO_FIT_CTX=100000
MIMO_FIT_TARGET=4096
MIMO_BATCH=512
MIMO_UBATCH=128
MIMO_REASONING=off
MIMO_CPU_MOE=0
```

For more memory headroom, use CPU-MoE mode:

```sh
MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
```

That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.

You can point the script at a specific server binary:

```sh
LLAMA_SERVER=/path/to/llama-server ./run-server.sh
```

## Tool-calling tips

- Disable reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
- Send tool schemas from the client rather than enabling llama.cpp built-in tools.
- Set `parallel_tool_calls` to `false` if your client supports it.
- Avoid forcing `tool_choice: required`; in testing, that made malformed calls more likely.
- Use a client that supports OpenAI-compatible tool calls cleanly.

## License

The upstream `XiaomiMiMo/MiMo-V2.5` model card declares the MIT license. This derived GGUF is provided with the same license metadata.