NeuroSenko
/

MiniMax-M2.7-exl3

Text Generation

exllamav3

exl3

Model card Files Files and versions

xet

Community

NeuroSenko commited on about 20 hours ago

Commit

4168bf0

verified ·

1 Parent(s): a6e91ea

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -6

README.md CHANGED Viewed

@@ -10,7 +10,11 @@ tags:
   - exl3
 ---
-Quantization was performed using [exllama3 v0.0.29](https://github.com/turboderp-org/exllamav3) (commit `cb1a436`).
 | Quant | Size (GB) | Actual bpw | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
 |---|---|---|---|---|---|---|---|
@@ -23,16 +27,14 @@ Quantization was performed using [exllama3 v0.0.29](https://github.com/turboderp
 | [8.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/8.0bpw) | 214.13 | 8.00 | 95.2% | 84.0% | 69.5% | 54.4% | 40.7% |
 | original | 214.36 | 8.00 | — | — | — | — | — |
-\* Original model produces inf/NaN in layer 61, making PPL and KL divergence non-computable.
 <details>
-<summary>Quantization Notes</summary>
 ### Inf/NaN values in calibration data
 Some experts in the model produce `inf` values during calibration (e.g. experts 61 and 74 in the last layer had inf values in their down-projection calibration state). The `lm_head` layer also exhibited NaN values in its calibration state (445K NaN out of 1.5B elements).
 This causes Cholesky decomposition to fail during quantization, as the Hessian matrix is no longer positive-definite. ExLlamaV3 does not handle this case gracefully — quantization crashes after exhausting retry attempts. A local patch was applied to fall back to uncalibrated quantization for the affected tensors. Given that only a handful of experts out of 256 in the last layer are affected, the impact on output quality is expected to be minimal.
-This appears to be a property of the model weights themselves, not a bug in the quantizer.
 </details>

   - exl3
 ---
+Quantization was performed using [exllamav3 v0.0.29](https://github.com/turboderp-org/exllamav3) (commit `cb1a436`).
+The original model is distributed in **FP8** (`float8_e4m3fn`), not FP16/BF16 — this is why the 8.0bpw quant is nearly identical in size to the original.
+PPL and KL divergence metrics are non-computable for this model due to inf/NaN values originating from layer 61 expert weights (see Quantization Notes below). This is not specific to EXL3 — the same issue [affects 21-38% of GGUF quantizations](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) across multiple providers. Top-K agreement against the original is provided instead.
 | Quant | Size (GB) | Actual bpw | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
 |---|---|---|---|---|---|---|---|
 | [8.0bpw](https://huggingface.co/NeuroSenko/MiniMax-M2.7-exl3/tree/8.0bpw) | 214.13 | 8.00 | 95.2% | 84.0% | 69.5% | 54.4% | 40.7% |
 | original | 214.36 | 8.00 | — | — | — | — | — |
 <details>
+<summary>Quantization Notes — inf/NaN in original model weights</summary>
 ### Inf/NaN values in calibration data
 Some experts in the model produce `inf` values during calibration (e.g. experts 61 and 74 in the last layer had inf values in their down-projection calibration state). The `lm_head` layer also exhibited NaN values in its calibration state (445K NaN out of 1.5B elements).
 This causes Cholesky decomposition to fail during quantization, as the Hessian matrix is no longer positive-definite. ExLlamaV3 does not handle this case gracefully — quantization crashes after exhausting retry attempts. A local patch was applied to fall back to uncalibrated quantization for the affected tensors. Given that only a handful of experts out of 256 in the last layer are affected, the impact on output quality is expected to be minimal.
+Note that inf/NaN values are present in the **original model** during inference as well — both the quantized and original models produce NaN perplexity. This appears to be caused by numerically unstable expert weights that produce overflow during forward pass, not by the quantizer itself. The same layer (`blk.61.ffn_down_exps`) [has been identified](https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/) as causing NaN perplexity across GGUF quantizations by multiple providers.
 </details>