Re-quantization in progress
Browse files
README.md
CHANGED
|
@@ -10,45 +10,27 @@ tags:
|
|
| 10 |
- minimax
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# MiniMax-M2.7 APEX GGUF
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
**
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
| Source | GGUF | CPU output | GPU output |
|
| 24 |
-
|---|---|---|---|
|
| 25 |
-
| **Ours** (APEX-I-Mini, Q3_K + imatrix) | 81 GB | garbled CJK/English fragments | `&&&&&&&&` / empty |
|
| 26 |
-
| **unsloth/MiniMax-M2.7-GGUF** (UD-Q4_K_M) | ~130 GB | same garbled output | same garbled output |
|
| 27 |
-
| **bartowski/MiniMaxAI_MiniMax-M2.7-GGUF** (Q4_K_M) | ~130 GB | same garbled output | same garbled output |
|
| 28 |
-
|
| 29 |
-
All three independently-produced GGUFs β using different converters, different quantization recipes, on different source formats (FP8 safetensors β BF16/F16 β k-quants) β exhibit the **same** failure mode. The model loads, the tokenizer parses correctly, and the chat template is applied, but the logits that come out of the forward pass are unusable (either repeated control tokens like `&&&&` on CUDA or a mix of unrelated BPE fragments on CPU).
|
| 30 |
-
|
| 31 |
-
## Root cause (as far as we can tell)
|
| 32 |
-
|
| 33 |
-
The `minimax-m2` architecture implementation in llama.cpp is incomplete for MiniMax-M2.7 specifically. Possible contributors:
|
| 34 |
-
|
| 35 |
-
- `MiniMaxM2Model.set_vocab()` is not overridden in `convert_hf_to_gguf.py` β it inherits the default `TextModel` path, which does not mark the 54 custom special tokens (e.g. `[e~[`, `]~b]system`, `]~b]ai`, `]~!b[`) used by the MiniMax chat template as `USER_DEFINED`. The chat-template parser relies on these being rendered verbatim.
|
| 36 |
-
- The FP8 `float8_e4m3fn` block-quantization (128Γ128) dequantization path used for the source weights may not produce exactly the tensors that the runtime expects for the MoE expert routing.
|
| 37 |
-
- The full `minimax-m2` graph (MoE routing + attention) may simply not match the reference implementation yet.
|
| 38 |
-
|
| 39 |
-
Because the same bug reproduces across bartowski/unsloth/our quants β and the raw (non-chat) completion endpoint also produces garbage β re-quantizing cannot fix this. A fix has to come from upstream llama.cpp.
|
| 40 |
-
|
| 41 |
-
## What to do if you want to run MiniMax-M2.7
|
| 42 |
|
| 43 |
-
-
|
| 44 |
-
- **Wait for a fix** in llama.cpp. Track [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/issues?q=minimax) for updates.
|
| 45 |
-
- When llama.cpp support matures, we'll re-publish APEX quants here.
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Credits
|
| 54 |
|
|
|
|
| 10 |
- minimax
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# MiniMax-M2.7 APEX GGUF
|
| 14 |
|
| 15 |
+
**APEX (Adaptive Precision for EXpert Models)** quantizations of [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7).
|
| 16 |
|
| 17 |
+
**Brought to you by the [LocalAI](https://github.com/mudler/LocalAI) team** | [APEX Project](https://github.com/mudler/apex-quant) | [Technical Report](https://github.com/mudler/apex-quant/blob/main/paper/APEX_Technical_Report.pdf)
|
| 18 |
|
| 19 |
+
> **Status: Re-quantization in progress.** The previous quants had a conversion bug (our direct FP8βBF16 path produced broken logits). We've identified the issue β using unsloth's pre-converted BF16 GGUF as the source instead β and are re-quantizing. Working quants will be back shortly.
|
| 20 |
|
| 21 |
+
## About APEX
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient β edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration.
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
See the [APEX project](https://github.com/mudler/apex-quant) for full details, technical report, and scripts.
|
| 26 |
|
| 27 |
+
## Architecture
|
| 28 |
|
| 29 |
+
- **Model**: MiniMax-M2.7 (MiniMaxM2)
|
| 30 |
+
- **Layers**: 62
|
| 31 |
+
- **Experts**: 256 routed (8 active per token)
|
| 32 |
+
- **Total Parameters**: ~228B
|
| 33 |
+
- **Active Parameters**: ~10B per token
|
| 34 |
|
| 35 |
## Credits
|
| 36 |
|