jedisct1 commited on
Commit
56a9fa5
·
verified ·
1 Parent(s): f7e8204

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +4 -47
README.md CHANGED
@@ -22,24 +22,10 @@ This is a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://hug
22
 
23
  This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
24
 
25
- It is a text-only llama.cpp conversion of the MiMo language backbone. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.
26
-
27
- ## Files
28
-
29
- The model is split into 16 GGUF shards:
30
-
31
- ```text
32
- MiMo-V2.5-coder-Q2-00001-of-00016.gguf
33
- ...
34
- MiMo-V2.5-coder-Q2-00016-of-00016.gguf
35
- ```
36
-
37
- Load the first shard. llama.cpp will find the remaining shards automatically.
38
 
39
  ## Quantization
40
 
41
- This artifact was quantized from the original XiaomiMiMo checkpoint, not from a third-party GGUF.
42
-
43
  High-level summary:
44
 
45
  - Quant type: `Q2_K_S`
@@ -50,17 +36,9 @@ High-level summary:
50
 
51
  One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
52
 
53
- ## Importance Matrix
54
-
55
- The importance matrix is what makes this quant more targeted than a generic low-bit conversion.
56
-
57
- It was built from English coding and agent-style prompts: reading files, searching code, running shell commands, editing workflows, short code review tasks, and OpenAI-compatible tool calls. The tool-call calibration included realistic argument shapes such as bounded file reads, tail reads, grep context lines, and command arrays.
58
-
59
- That means the quantization tries to spend its limited precision budget on weights that matter for coding and structured tool use. It is not a general-purpose multilingual calibration set, and it was not designed to preserve Chinese or multimodal quality.
60
 
61
- ## Why This Quant Exists
62
-
63
- The goal is not to preserve every capability of the original model equally. This build deliberately prioritizes:
64
 
65
  - reliable OpenAI-compatible tool calls
66
  - coding and shell-oriented agent use
@@ -71,8 +49,6 @@ Chinese-language quality and multimodal use were not optimization targets.
71
 
72
  ## Serving
73
 
74
- Install or build llama.cpp with `llama-server` available in `PATH`. Once the model is on Hugging Face, the usual way to run it is directly from the Hub:
75
-
76
  ```sh
77
  llama-server \
78
  -hf jedisct1/MiMo-V2.5-coder-Q2 \
@@ -193,32 +169,13 @@ llama-server \
193
 
194
  For best tool-calling results:
195
 
196
- - Use OpenAI-compatible request-provided tool schemas.
197
- - Keep llama.cpp built-in tools disabled unless you are specifically testing them.
198
  - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
199
  - Set `parallel_tool_calls` to `false` if your client supports it.
200
  - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
201
 
202
  This build was tested with Swival-style tool schemas for `read_file`, `grep`, `outline`, `run_command`, and `todo`.
203
 
204
- ## Local Test Results
205
-
206
- On the local 128 GB Apple Silicon M5 setup:
207
-
208
- - The first local `Q2_K` artifact passed only 5/7 in its best Swival-shaped serving configuration.
209
- - This imatrix-backed `Q2_K_S` artifact passed 21/21 across three Swival-shaped repeat runs with fast M5 serving defaults.
210
- - A Swival-style `Hello` request rendered to 4,411 prompt tokens because the client included a system prompt and tool schemas. With fast M5 serving, llama.cpp processed that payload at about 239 prompt tokens/sec. With all MoE tensors on CPU, the same class of prompt processed at about 13 prompt tokens/sec.
211
-
212
- These are local smoke and agent-harness results, not a public benchmark suite.
213
-
214
- ## Limitations
215
-
216
- - Text-only GGUF: no vision, video, or audio encoders.
217
- - MTP blocks are omitted.
218
- - The quantization is very low bit. It is intended to fit and run locally, not to match the full BF16 checkpoint.
219
- - The default 100,000-token context is much smaller than MiMo-V2.5's advertised 1M training context, but much more practical on this hardware.
220
- - Quality should be validated on your own coding and tool-calling workloads before relying on it.
221
-
222
  ## License
223
 
224
  The upstream model card for `XiaomiMiMo/MiMo-V2.5` declares the MIT license. This derived GGUF is provided under the same license metadata.
 
22
 
23
  This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
24
 
25
+ It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## Quantization
28
 
 
 
29
  High-level summary:
30
 
31
  - Quant type: `Q2_K_S`
 
36
 
37
  One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
38
 
39
+ ## Imatrix
 
 
 
 
 
 
40
 
41
+ This build deliberately prioritizes:
 
 
42
 
43
  - reliable OpenAI-compatible tool calls
44
  - coding and shell-oriented agent use
 
49
 
50
  ## Serving
51
 
 
 
52
  ```sh
53
  llama-server \
54
  -hf jedisct1/MiMo-V2.5-coder-Q2 \
 
169
 
170
  For best tool-calling results:
171
 
172
+ - Use the [Swival](https://swival.dev) harness
 
173
  - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
174
  - Set `parallel_tool_calls` to `false` if your client supports it.
175
  - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
176
 
177
  This build was tested with Swival-style tool schemas for `read_file`, `grep`, `outline`, `run_command`, and `todo`.
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ## License
180
 
181
  The upstream model card for `XiaomiMiMo/MiMo-V2.5` declares the MIT license. This derived GGUF is provided under the same license metadata.