File size: 4,457 Bytes
94e80c7
 
 
 
 
 
 
 
 
 
 
 
4168bf0
 
 
 
9272d3d
94e80c7
a6e91ea
 
 
 
 
 
 
 
 
 
 
94e80c7
ec0e458
94e80c7
ec0e458
94e80c7
 
 
 
4168bf0
 
94e80c7
747f2b9
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
pipeline_tag: text-generation
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE
library_name: exllamav3
base_model: MiniMaxAI/MiniMax-M2.7
base_model_relation: quantized
tags:
  - exl3
---

Quantization was performed using [exllamav3 v0.0.29](https://github.com/turboderp-org/exllamav3) (commit `cb1a436`).

The original model is distributed in **FP8** (`float8_e4m3fn`), not FP16/BF16 β€” this is why the 8.0bpw quant is nearly identical in size to the original.

PPL and KL divergence metrics are non-computable for this model due to inf/NaN values produced by layer 61 experts during forward pass. This is not specific to EXL3 β€” the same issue [affects 21-38% of GGUF quantizations](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) across multiple providers. Top-K agreement against the original is provided instead.

| Quant | Size (GB) | Actual bpw | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|
| [2.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/2.0bpw) | 55.14 | 2.00 | 76.0% | 41.8% | 18.5% | 7.1% | 2.5% |
| [3.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/3.0bpw) | 81.61 | 3.00 | 85.6% | 59.3% | 35.1% | 18.5% | 8.9% |
| [4.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/4.0bpw) | 108.09 | 4.00 | 90.3% | 70.5% | 49.0% | 31.2% | 18.5% |
| [5.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/5.0bpw) | 134.56 | 5.00 | 92.9% | 77.5% | 59.1% | 41.7% | 27.7% |
| [6.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/6.0bpw) | 161.18 | 6.00 | 94.4% | 81.5% | 65.2% | 49.1% | 35.0% |
| [7.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/7.0bpw) | 187.65 | 7.00 | 94.9% | 83.2% | 68.0% | 52.5% | 38.6% |
| [8.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/8.0bpw) | 214.13 | 8.00 | 95.2% | 84.0% | 69.5% | 54.4% | 40.7% |
| original | 214.36 | 8.00 | β€” | β€” | β€” | β€” | β€” |

<details>
<summary>Quantization Notes</summary>

### Inf/NaN values during calibration

Some experts in the model produce `inf` values during calibration (e.g. experts 61 and 74 in the last layer had inf values in their down-projection calibration state). The `lm_head` layer also exhibited NaN values in its calibration state (445K NaN out of 1.5B elements).
                                                                                                                                                                                              
This causes Cholesky decomposition to fail during quantization, as the Hessian matrix is no longer positive-definite. ExLlamaV3 does not handle this case gracefully β€” quantization crashes after exhausting retry attempts. A local patch was applied to fall back to uncalibrated quantization for the affected tensors. Given that only a handful of experts out of 256 in the last layer are affected, the impact on output quality is expected to be minimal.

Note that inf/NaN values are present in the **original model** during inference as well β€” both the quantized and original models produce NaN perplexity. This appears to be caused by numerically unstable expert weights that produce overflow during forward pass, not by the quantizer itself. The same layer (`blk.61.ffn_down_exps`) [has been identified](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) as causing NaN perplexity across GGUF quantizations by multiple providers.
</details>

### Setup tool calling for TabbyAPI                                                                                                                        
                                                                                                                                                           
Add `tool_format: minimax_m2` to your `config.yml` or per-model `tabby_config.yml`. Also enable `reasoning: true` to properly separate thinking blocks from output:                                                        
                                                                                                                                                           
```yaml                                                                                                                                                  
model:                                                          
  tool_format: minimax_m2
  reasoning: true
```