Update README.md

747f2b9 verified about 11 hours ago

4.46 kB

pipeline_tag: text-generation
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE
library_name: exllamav3
base_model: MiniMaxAI/MiniMax-M2.7
base_model_relation: quantized
tags:
  - exl3

Quantization was performed using exllamav3 v0.0.29 (commit cb1a436).

The original model is distributed in FP8 (float8_e4m3fn), not FP16/BF16 — this is why the 8.0bpw quant is nearly identical in size to the original.

PPL and KL divergence metrics are non-computable for this model due to inf/NaN values produced by layer 61 experts during forward pass. This is not specific to EXL3 — the same issue affects 21-38% of GGUF quantizations across multiple providers. Top-K agreement against the original is provided instead.

Quant	Size (GB)	Actual bpw	Top-1	Top-2	Top-3	Top-4	Top-5
2.0bpw	55.14	2.00	76.0%	41.8%	18.5%	7.1%	2.5%
3.0bpw	81.61	3.00	85.6%	59.3%	35.1%	18.5%	8.9%
4.0bpw	108.09	4.00	90.3%	70.5%	49.0%	31.2%	18.5%
5.0bpw	134.56	5.00	92.9%	77.5%	59.1%	41.7%	27.7%
6.0bpw	161.18	6.00	94.4%	81.5%	65.2%	49.1%	35.0%
7.0bpw	187.65	7.00	94.9%	83.2%	68.0%	52.5%	38.6%
8.0bpw	214.13	8.00	95.2%	84.0%	69.5%	54.4%	40.7%
original	214.36	8.00	—	—	—	—	—

Quantization Notes

Inf/NaN values during calibration

Some experts in the model produce inf values during calibration (e.g. experts 61 and 74 in the last layer had inf values in their down-projection calibration state). The lm_head layer also exhibited NaN values in its calibration state (445K NaN out of 1.5B elements).

This causes Cholesky decomposition to fail during quantization, as the Hessian matrix is no longer positive-definite. ExLlamaV3 does not handle this case gracefully — quantization crashes after exhausting retry attempts. A local patch was applied to fall back to uncalibrated quantization for the affected tensors. Given that only a handful of experts out of 256 in the last layer are affected, the impact on output quality is expected to be minimal.

Note that inf/NaN values are present in the original model during inference as well — both the quantized and original models produce NaN perplexity. This appears to be caused by numerically unstable expert weights that produce overflow during forward pass, not by the quantizer itself. The same layer (blk.61.ffn_down_exps) has been identified as causing NaN perplexity across GGUF quantizations by multiple providers.

Setup tool calling for TabbyAPI

Add tool_format: minimax_m2 to your config.yml or per-model tabby_config.yml. Also enable reasoning: true to properly separate thinking blocks from output:

model:                                                          
  tool_format: minimax_m2
  reasoning: true