--- pipeline_tag: text-generation license: other license_name: modified-mit license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE library_name: exllamav3 base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized tags: - exl3 --- Quantization was performed using [exllamav3 v0.0.29](https://github.com/turboderp-org/exllamav3) (commit `cb1a436`). The original model is distributed in **FP8** (`float8_e4m3fn`), not FP16/BF16 — this is why the 8.0bpw quant is nearly identical in size to the original. PPL and KL divergence metrics are non-computable for this model due to inf/NaN values produced by layer 61 experts during forward pass. This is not specific to EXL3 — the same issue [affects 21-38% of GGUF quantizations](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) across multiple providers. Top-K agreement against the original is provided instead. | Quant | Size (GB) | Actual bpw | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 | |---|---|---|---|---|---|---|---| | [2.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/2.0bpw) | 55.14 | 2.00 | 76.0% | 41.8% | 18.5% | 7.1% | 2.5% | | [3.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/3.0bpw) | 81.61 | 3.00 | 85.6% | 59.3% | 35.1% | 18.5% | 8.9% | | [4.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/4.0bpw) | 108.09 | 4.00 | 90.3% | 70.5% | 49.0% | 31.2% | 18.5% | | [5.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/5.0bpw) | 134.56 | 5.00 | 92.9% | 77.5% | 59.1% | 41.7% | 27.7% | | [6.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/6.0bpw) | 161.18 | 6.00 | 94.4% | 81.5% | 65.2% | 49.1% | 35.0% | | [7.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/7.0bpw) | 187.65 | 7.00 | 94.9% | 83.2% | 68.0% | 52.5% | 38.6% | | [8.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/8.0bpw) | 214.13 | 8.00 | 95.2% | 84.0% | 69.5% | 54.4% | 40.7% | | original | 214.36 | 8.00 | — | — | — | — | — |
Quantization Notes ### Inf/NaN values during calibration Some experts in the model produce `inf` values during calibration (e.g. experts 61 and 74 in the last layer had inf values in their down-projection calibration state). The `lm_head` layer also exhibited NaN values in its calibration state (445K NaN out of 1.5B elements). This causes Cholesky decomposition to fail during quantization, as the Hessian matrix is no longer positive-definite. ExLlamaV3 does not handle this case gracefully — quantization crashes after exhausting retry attempts. A local patch was applied to fall back to uncalibrated quantization for the affected tensors. Given that only a handful of experts out of 256 in the last layer are affected, the impact on output quality is expected to be minimal. Note that inf/NaN values are present in the **original model** during inference as well — both the quantized and original models produce NaN perplexity. This appears to be caused by numerically unstable expert weights that produce overflow during forward pass, not by the quantizer itself. The same layer (`blk.61.ffn_down_exps`) [has been identified](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) as causing NaN perplexity across GGUF quantizations by multiple providers.
### Setup tool calling for TabbyAPI Add `tool_format: minimax_m2` to your `config.yml` or per-model `tabby_config.yml`. Also enable `reasoning: true` to properly separate thinking blocks from output: ```yaml model: tool_format: minimax_m2 reasoning: true ```