mudler commited on
Commit
aaee156
Β·
verified Β·
1 Parent(s): 30ec379

Re-quantization in progress

Browse files
Files changed (1) hide show
  1. README.md +13 -31
README.md CHANGED
@@ -10,45 +10,27 @@ tags:
10
  - minimax
11
  ---
12
 
13
- # MiniMax-M2.7 APEX GGUF β€” ⚠️ Quants Removed (Upstream Bug)
14
 
15
- ## Status: Quants removed due to upstream llama.cpp bug
16
 
17
- **The APEX quants for MiniMax-M2.7 have been removed from this repository.** During testing we discovered that the `minimax-m2` architecture in current llama.cpp (as of b8779, April 2026) produces garbled / degenerate output at inference time, regardless of how the model is quantized.
18
 
19
- ## What we tested
20
 
21
- We verified the problem is **not specific to our APEX quants** by testing multiple independent sources:
22
-
23
- | Source | GGUF | CPU output | GPU output |
24
- |---|---|---|---|
25
- | **Ours** (APEX-I-Mini, Q3_K + imatrix) | 81 GB | garbled CJK/English fragments | `&&&&&&&&` / empty |
26
- | **unsloth/MiniMax-M2.7-GGUF** (UD-Q4_K_M) | ~130 GB | same garbled output | same garbled output |
27
- | **bartowski/MiniMaxAI_MiniMax-M2.7-GGUF** (Q4_K_M) | ~130 GB | same garbled output | same garbled output |
28
-
29
- All three independently-produced GGUFs β€” using different converters, different quantization recipes, on different source formats (FP8 safetensors β†’ BF16/F16 β†’ k-quants) β€” exhibit the **same** failure mode. The model loads, the tokenizer parses correctly, and the chat template is applied, but the logits that come out of the forward pass are unusable (either repeated control tokens like `&&&&` on CUDA or a mix of unrelated BPE fragments on CPU).
30
-
31
- ## Root cause (as far as we can tell)
32
-
33
- The `minimax-m2` architecture implementation in llama.cpp is incomplete for MiniMax-M2.7 specifically. Possible contributors:
34
-
35
- - `MiniMaxM2Model.set_vocab()` is not overridden in `convert_hf_to_gguf.py` β€” it inherits the default `TextModel` path, which does not mark the 54 custom special tokens (e.g. `[e~[`, `]~b]system`, `]~b]ai`, `]~!b[`) used by the MiniMax chat template as `USER_DEFINED`. The chat-template parser relies on these being rendered verbatim.
36
- - The FP8 `float8_e4m3fn` block-quantization (128Γ—128) dequantization path used for the source weights may not produce exactly the tensors that the runtime expects for the MoE expert routing.
37
- - The full `minimax-m2` graph (MoE routing + attention) may simply not match the reference implementation yet.
38
-
39
- Because the same bug reproduces across bartowski/unsloth/our quants β€” and the raw (non-chat) completion endpoint also produces garbage β€” re-quantizing cannot fix this. A fix has to come from upstream llama.cpp.
40
-
41
- ## What to do if you want to run MiniMax-M2.7
42
 
43
- - **Run the native weights** with vLLM, SGLang, or transformers directly β€” those backends handle the `minimax-m2` architecture correctly.
44
- - **Wait for a fix** in llama.cpp. Track [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/issues?q=minimax) for updates.
45
- - When llama.cpp support matures, we'll re-publish APEX quants here.
46
 
47
- ## About APEX
48
 
49
- APEX (Adaptive Precision for EXpert Models) is a quantization strategy for Mixture-of-Experts models that classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient β€” edge layers get higher precision, middle layers get more aggressive compression. I-variants additionally use diverse imatrix calibration.
50
 
51
- See the [APEX project](https://github.com/mudler/apex-quant) for details, technical report, and scripts. Working APEX quants are available for many other MoE models β€” see e.g. [mudler/gemma-4-26B-A4B-it-APEX-GGUF](https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF).
 
 
 
 
52
 
53
  ## Credits
54
 
 
10
  - minimax
11
  ---
12
 
13
+ # MiniMax-M2.7 APEX GGUF
14
 
15
+ **APEX (Adaptive Precision for EXpert Models)** quantizations of [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7).
16
 
17
+ **Brought to you by the [LocalAI](https://github.com/mudler/LocalAI) team** | [APEX Project](https://github.com/mudler/apex-quant) | [Technical Report](https://github.com/mudler/apex-quant/blob/main/paper/APEX_Technical_Report.pdf)
18
 
19
+ > **Status: Re-quantization in progress.** The previous quants had a conversion bug (our direct FP8β†’BF16 path produced broken logits). We've identified the issue β€” using unsloth's pre-converted BF16 GGUF as the source instead β€” and are re-quantizing. Working quants will be back shortly.
20
 
21
+ ## About APEX
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient β€” edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration.
 
 
24
 
25
+ See the [APEX project](https://github.com/mudler/apex-quant) for full details, technical report, and scripts.
26
 
27
+ ## Architecture
28
 
29
+ - **Model**: MiniMax-M2.7 (MiniMaxM2)
30
+ - **Layers**: 62
31
+ - **Experts**: 256 routed (8 active per token)
32
+ - **Total Parameters**: ~228B
33
+ - **Active Parameters**: ~10B per token
34
 
35
  ## Credits
36