jedisct1 commited on
Commit
c3ebe02
·
verified ·
1 Parent(s): eea7d9c

Add files using upload-large-folder tool

Browse files
Files changed (2) hide show
  1. .swival/repl_history +9 -0
  2. README.md +41 -13
.swival/repl_history ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # 2026-05-25 12:10:06.766878
3
+ +hello
4
+
5
+ # 2026-05-25 12:11:32.119684
6
+ +List my files
7
+
8
+ # 2026-05-25 12:20:09.313414
9
+ +/new
README.md CHANGED
@@ -13,16 +13,21 @@ tags:
13
  - agent
14
  - mixture-of-experts
15
  - long-context
 
16
  pipeline_tag: text-generation
17
  ---
18
 
19
- # MiMo-V2.5 Coder Q2 GGUF
20
 
21
- This is a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and tool-calling on a 128 GB Apple Silicon M5 machine.
22
 
23
- This quant was optimized for systems with 128 GB of memory. The default serving profile targets a 128 GB Apple Silicon machine and tries to keep the model practical at a 100,000-token context. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
24
 
25
- It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders. The MiMo multi-token prediction blocks were also omitted during conversion because normal llama.cpp generation does not currently execute them for this model.
 
 
 
 
26
 
27
  ## Quantization
28
 
@@ -30,9 +35,10 @@ High-level summary:
30
 
31
  - Quant type: `Q2_K_S`
32
  - Importance matrix: coding and tool-calling focused
33
- - Preserved higher precision for embeddings, output, attention, and the dense first FFN
34
  - MoE down-expert tensors: `Q3_K`
35
- - Reported quantized size: about 108,496.76 MiB at 2.95 BPW
 
36
 
37
  One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
38
 
@@ -45,13 +51,17 @@ This build deliberately prioritizes:
45
  - English prompts and codebase work
46
  - practical inference on a 128 GB Apple Silicon system
47
 
 
 
48
  Chinese-language quality and multimodal use were not optimization targets.
49
 
50
  ## Serving
51
 
 
 
52
  ```sh
53
  llama-server \
54
- -hf jedisct1/MiMo-V2.5-coder-Q2 \
55
  --host 127.0.0.1 \
56
  --port 8080 \
57
  --ctx-size 100000 \
@@ -70,7 +80,9 @@ llama-server \
70
  --gpu-layers auto \
71
  --cache-type-k f16 \
72
  --cache-type-v f16 \
73
- --reasoning off
 
 
74
  ```
75
 
76
  This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
@@ -93,9 +105,11 @@ MIMO_BATCH=512
93
  MIMO_UBATCH=128
94
  MIMO_REASONING=off
95
  MIMO_CPU_MOE=0
 
 
96
  ```
97
 
98
- These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, and ask llama.cpp to fit as much of the model as possible onto Metal.
99
 
100
  If you hit memory pressure, use the safer CPU-MoE mode:
101
 
@@ -115,7 +129,7 @@ You can also run `llama-server` directly against local files without the helper
115
 
116
  ```sh
117
  llama-server \
118
- --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
119
  --host 127.0.0.1 \
120
  --port 8080 \
121
  --ctx-size 100000 \
@@ -134,14 +148,16 @@ llama-server \
134
  --gpu-layers auto \
135
  --cache-type-k f16 \
136
  --cache-type-v f16 \
137
- --reasoning off
 
 
138
  ```
139
 
140
  For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
141
 
142
  ```sh
143
  llama-server \
144
- --model MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
145
  --ctx-size 100000 \
146
  --fit on \
147
  --fit-target 32768 \
@@ -154,17 +170,29 @@ llama-server \
154
  --cache-type-k f16 \
155
  --cache-type-v f16 \
156
  --reasoning off \
 
 
157
  --cpu-moe
158
  ```
159
 
 
 
 
 
 
 
 
 
 
 
160
  ## Tool-Calling Notes
161
 
162
  For best tool-calling results:
163
 
164
- - Use the [Swival](https://swival.dev) harness - it should work with anything using OpenAI-like tool calling convention, but it is tested with Swival.
165
  - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
166
  - Set `parallel_tool_calls` to `false` if your client supports it.
167
  - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
 
168
 
169
  ## License
170
 
 
13
  - agent
14
  - mixture-of-experts
15
  - long-context
16
+ - mtp
17
  pipeline_tag: text-generation
18
  ---
19
 
20
+ # MiMo-V2.5 Coder Q2 MTP GGUF
21
 
22
+ *Work in progress, please use the non-MTP version for now*
23
 
24
+ This is the MTP-included sibling of `MiMo-V2.5-coder-Q2`: a local, self-quantized GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling.
25
 
26
+ This quant was optimized for systems with 128 GB of memory and a 100,000 tokens context size. Smaller-memory systems will likely need more aggressive CPU offload, a smaller context, or a different quant.
27
+
28
+ It is a text-only quantization. The original MiMo-V2.5 checkpoint is omnimodal, but this GGUF does not include the vision or audio encoders.
29
+
30
+ This variant includes MiMo's three multi-token prediction blocks and is meant for llama.cpp builds with `draft-mtp` speculative decoding support. If you run it as a plain non-speculative model, llama.cpp may report the trailing MTP tensors as unused; that is expected when speculative MTP is disabled.
31
 
32
  ## Quantization
33
 
 
35
 
36
  - Quant type: `Q2_K_S`
37
  - Importance matrix: coding and tool-calling focused
38
+ - Preserved higher precision for embeddings, output, attention, dense first FFN, and MTP dense/projection tensors
39
  - MoE down-expert tensors: `Q3_K`
40
+ - Reported quantized size: about 109,026.87 MiB at 2.95 BPW
41
+ - MTP metadata: `mimo2.block_count = 51`, `mimo2.nextn_predict_layers = 3`
42
 
43
  One tokenizer metadata fix is included so llama.cpp does not warn about the base-vocab `</s>` token at load time. MiMo's actual EOS token remains `<|im_end|>`.
44
 
 
51
  - English prompts and codebase work
52
  - practical inference on a 128 GB Apple Silicon system
53
 
54
+ The importance matrix was built from an expanded English calibration set with coding, review, shell, and Swival-style tool-use prompts. It used the first runnable local Q2 GGUF as the calibration model and focused on the main text generation path. The MTP tensors were not present in that first-pass calibration matrix, so they were protected manually at `Q4_K` where it mattered most.
55
+
56
  Chinese-language quality and multimodal use were not optimization targets.
57
 
58
  ## Serving
59
 
60
+ Most users should start it directly from Hugging Face with llama.cpp:
61
+
62
  ```sh
63
  llama-server \
64
+ -hf jedisct1/MiMo-V2.5-coder-Q2-MTP \
65
  --host 127.0.0.1 \
66
  --port 8080 \
67
  --ctx-size 100000 \
 
80
  --gpu-layers auto \
81
  --cache-type-k f16 \
82
  --cache-type-v f16 \
83
+ --reasoning off \
84
+ --spec-type draft-mtp \
85
+ --spec-draft-n-max 3
86
  ```
87
 
88
  This starts an OpenAI-compatible server on `127.0.0.1:8080`. The repository contains one GGUF split set, so recent llama.cpp builds should select the first shard automatically.
 
105
  MIMO_UBATCH=128
106
  MIMO_REASONING=off
107
  MIMO_CPU_MOE=0
108
+ MIMO_SPEC_TYPE=draft-mtp
109
+ MIMO_SPEC_DRAFT_N_MAX=3
110
  ```
111
 
112
+ These defaults are tuned for an Apple M5 Max with 128 GB unified memory. They keep reasoning output disabled, use the model's Jinja chat template, use Flash Attention, enable llama.cpp's MTP speculative decoding, and ask llama.cpp to fit as much of the model as possible onto Metal.
113
 
114
  If you hit memory pressure, use the safer CPU-MoE mode:
115
 
 
129
 
130
  ```sh
131
  llama-server \
132
+ --model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
133
  --host 127.0.0.1 \
134
  --port 8080 \
135
  --ctx-size 100000 \
 
148
  --gpu-layers auto \
149
  --cache-type-k f16 \
150
  --cache-type-v f16 \
151
+ --reasoning off \
152
+ --spec-type draft-mtp \
153
+ --spec-draft-n-max 3
154
  ```
155
 
156
  For the safer CPU-MoE fallback, add `--cpu-moe` and use a larger fit margin:
157
 
158
  ```sh
159
  llama-server \
160
+ --model MiMo-V2.5-coder-Q2-MTP-00001-of-00016.gguf \
161
  --ctx-size 100000 \
162
  --fit on \
163
  --fit-target 32768 \
 
170
  --cache-type-k f16 \
171
  --cache-type-v f16 \
172
  --reasoning off \
173
+ --spec-type draft-mtp \
174
+ --spec-draft-n-max 3 \
175
  --cpu-moe
176
  ```
177
 
178
+ ## MTP Runtime Note
179
+
180
+ This GGUF keeps the MTP tensors and the serving examples enable llama.cpp's `draft-mtp` speculative decoder. Plain generation without `--spec-type draft-mtp` can show warnings like `model has unused tensor blk.48...` because the MTP blocks are not part of the normal trunk pass. That warning is expected for non-speculative loads and is not a corrupted-file warning.
181
+
182
+ To disable speculative decoding for troubleshooting:
183
+
184
+ ```sh
185
+ MIMO_SPEC_TYPE=none ./run-server.sh
186
+ ```
187
+
188
  ## Tool-Calling Notes
189
 
190
  For best tool-calling results:
191
 
 
192
  - Disable model reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
193
  - Set `parallel_tool_calls` to `false` if your client supports it.
194
  - Avoid forcing `tool_choice: required`; in testing it made malformed calls more likely.
195
+ - Use request-provided OpenAI tool schemas rather than llama.cpp built-in server tools unless you are intentionally testing those built-ins.
196
 
197
  ## License
198