Text Generation
GGUF
English
llama.cpp
code
coding
tool-calling
agent
mixture-of-experts
long-context
imatrix
conversational
Instructions to use jedisct1/MiMo-V2.5-coder-Q2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jedisct1/MiMo-V2.5-coder-Q2", filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use jedisct1/MiMo-V2.5-coder-Q2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2 # Run inference directly in the terminal: ./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2
Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- LM Studio
- Jan
- vLLM
How to use jedisct1/MiMo-V2.5-coder-Q2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jedisct1/MiMo-V2.5-coder-Q2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/MiMo-V2.5-coder-Q2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2 with Ollama:
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Unsloth Studio new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jedisct1/MiMo-V2.5-coder-Q2 to start chatting
- Pi new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/MiMo-V2.5-coder-Q2" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/MiMo-V2.5-coder-Q2 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jedisct1/MiMo-V2.5-coder-Q2
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2
Run Hermes
hermes
- Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2 with Docker Model Runner:
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2
- Lemonade
How to use jedisct1/MiMo-V2.5-coder-Q2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jedisct1/MiMo-V2.5-coder-Q2
Run and chat with the model
lemonade run user.MiMo-V2.5-coder-Q2-{{QUANT_TAG}}List all available models
lemonade list
| license: mit | |
| base_model: XiaomiMiMo/MiMo-V2.5 | |
| language: | |
| - en | |
| library_name: llama.cpp | |
| tags: | |
| - gguf | |
| - llama.cpp | |
| - text-generation | |
| - code | |
| - coding | |
| - tool-calling | |
| - agent | |
| - mixture-of-experts | |
| - long-context | |
| pipeline_tag: text-generation | |
| # MiMo-V2.5 Coder Q2 v2 GGUF | |
| This is a text-only GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling on high-memory local machines. | |
| The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant. | |
| This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference. | |
| ## Why this build exists | |
| Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally: | |
| - coding in common systems and scripting languages | |
| - web UI/component generation | |
| - OpenAI-compatible tool calling | |
| - agent loops over real files and commands | |
| - long English technical prompts | |
| Chinese-language quality and multimodal behavior were not optimization targets. | |
| ## How it was built | |
| The source was the original `XiaomiMiMo/MiMo-V2.5` checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute. | |
| The final artifact is a split `Q2_K_S` GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth. | |
| The build was iterative: | |
| 1. Convert the original checkpoint to split BF16 GGUF. | |
| 2. Produce a first low-bit coding/tool-use candidate. | |
| 3. Test that candidate on executable coding tasks and realistic tool-calling agent loops. | |
| 4. Add calibration coverage for the failures that showed up in real tests. | |
| 5. Rebuild the importance matrix from the expanded coding/tool-use prompt mix. | |
| 6. Re-quantize with the final `Q2_K_S` recipe. | |
| The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files. | |
| Quantization details: | |
| - Quant type: `Q2_K_S` | |
| - Importance matrix: coding and tool-calling focused | |
| - Embeddings and output tensors kept at higher precision | |
| - Attention and dense first-FFN tensors protected at higher precision | |
| - MoE down-expert tensors kept at `Q3_K` | |
| - Reported size: about 108,496.76 MiB, 2.95 BPW | |
| - Split files: 16 GGUF shards | |
| One tokenizer metadata fix is included: the base-vocabulary `</s>` token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains `<|im_end|>`. | |
| ## Why this recipe was chosen | |
| The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included. | |
| The first plain `Q2_K` family candidate was small enough, but it was not reliable enough for tool calling. It malformed some tool-call arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most: | |
| - embeddings and output tensors stay higher precision because they are important for token identity and exact syntax | |
| - attention tensors are protected because tool-call and code prompts are structure-heavy | |
| - the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization | |
| - MoE down-expert tensors use `Q3_K`, which was a better quality/memory tradeoff than pushing all expert down-projections lower | |
| That is why this is still a Q2-class build, but not the smallest possible Q2 build. | |
| ## Why it is good at coding | |
| This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified. | |
| The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give. | |
| The final v2 artifact passed the local coding and web-design harness across: | |
| - Swift | |
| - JavaScript | |
| - TypeScript through Deno | |
| - Rust | |
| - C | |
| - C++ | |
| - Zig | |
| - Python | |
| - Perl | |
| - Go | |
| - static HTML/CSS | |
| That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt. | |
| It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3. | |
| The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions. | |
| ## Tool-calling validation | |
| Tool calling was exercised in realistic agent loops rather than only checking toy single-call examples. The harness used for this validation was [Swival](https://swival.dev). Nothing in the build is tied to it, and any OpenAI-compatible agent harness is likely to work in much the same way, but Swival is the only one that has actually been put through its paces here. | |
| Validation included: | |
| - a broad synthetic selector suite covering a wide tool surface | |
| - real one-shot agent tasks over files, grep, command execution, fetches, image input, skills, snapshots, todos, and subagents | |
| - a real goal-mode run that required the model to complete work and call a final completion tool | |
| The current v2 results were: | |
| - all-tools selector: 22/22 | |
| - real one-shot agent suite: 10/10 with zero failed tool calls | |
| - real goal-mode completion call: passed with exactly one successful final call | |
| A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures. | |
| These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked. | |
| Compared with the earlier local candidate, the v2 build fixed the key practical failures: the selector suite went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real agent task suite completed with zero failed tool calls. This is why the package is labeled `v2`. | |
| ## Serving with llama.cpp | |
| Recent llama.cpp builds should be able to load the repo directly: | |
| ```sh | |
| llama-server \ | |
| -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \ | |
| --host 127.0.0.1 \ | |
| --port 8080 \ | |
| --ctx-size 100000 \ | |
| --parallel 1 \ | |
| --batch-size 512 \ | |
| --ubatch-size 128 \ | |
| --threads 12 \ | |
| --threads-batch 18 \ | |
| --prio 0 \ | |
| --poll 80 \ | |
| --flash-attn on \ | |
| --jinja \ | |
| --fit on \ | |
| --fit-target 4096 \ | |
| --fit-ctx 100000 \ | |
| --gpu-layers auto \ | |
| --cache-type-k f16 \ | |
| --cache-type-v f16 \ | |
| --reasoning off | |
| ``` | |
| If you cloned or downloaded the repository locally, you can use the helper script: | |
| ```sh | |
| ./run-server.sh | |
| ``` | |
| The helper script loads the first GGUF shard next to it and uses the same default serving profile. | |
| Default settings: | |
| ```sh | |
| MIMO_CTX=100000 | |
| MIMO_FIT_CTX=100000 | |
| MIMO_FIT_TARGET=4096 | |
| MIMO_BATCH=512 | |
| MIMO_UBATCH=128 | |
| MIMO_REASONING=off | |
| MIMO_CPU_MOE=0 | |
| ``` | |
| For more memory headroom, use CPU-MoE mode: | |
| ```sh | |
| MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh | |
| ``` | |
| That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available. | |
| You can point the script at a specific server binary: | |
| ```sh | |
| LLAMA_SERVER=/path/to/llama-server ./run-server.sh | |
| ``` | |
| ## Tool-calling tips | |
| - Disable reasoning output with `--reasoning off` or `MIMO_REASONING=off`. | |
| - Send tool schemas from the client rather than enabling llama.cpp built-in tools. | |
| - Set `parallel_tool_calls` to `false` if your client supports it. | |
| - Avoid forcing `tool_choice: required`; in testing, that made malformed calls more likely. | |
| - Use a client that supports OpenAI-compatible tool calls cleanly. | |
| ## License | |
| The upstream `XiaomiMiMo/MiMo-V2.5` model card declares the MIT license. This derived GGUF is provided with the same license metadata. | |