mudler
/

MiniMax-M2.7-APEX-GGUF

@@ -10,45 +10,27 @@ tags:
   - minimax
 ---
-# MiniMax-M2.7 APEX GGUF — ⚠️ Quants Removed (Upstream Bug)
-## Status: Quants removed due to upstream llama.cpp bug
-**The APEX quants for MiniMax-M2.7 have been removed from this repository.** During testing we discovered that the `minimax-m2` architecture in current llama.cpp (as of b8779, April 2026) produces garbled / degenerate output at inference time, regardless of how the model is quantized.
-## What we tested
-We verified the problem is **not specific to our APEX quants** by testing multiple independent sources:
-| Source | GGUF | CPU output | GPU output |
-|---|---|---|---|
-| **Ours** (APEX-I-Mini, Q3_K + imatrix) | 81 GB | garbled CJK/English fragments | `&&&&&&&&` / empty |
-| **unsloth/MiniMax-M2.7-GGUF** (UD-Q4_K_M) | ~130 GB | same garbled output | same garbled output |
-| **bartowski/MiniMaxAI_MiniMax-M2.7-GGUF** (Q4_K_M) | ~130 GB | same garbled output | same garbled output |
-All three independently-produced GGUFs — using different converters, different quantization recipes, on different source formats (FP8 safetensors → BF16/F16 → k-quants) — exhibit the **same** failure mode. The model loads, the tokenizer parses correctly, and the chat template is applied, but the logits that come out of the forward pass are unusable (either repeated control tokens like `&&&&` on CUDA or a mix of unrelated BPE fragments on CPU).
-## Root cause (as far as we can tell)
-The `minimax-m2` architecture implementation in llama.cpp is incomplete for MiniMax-M2.7 specifically. Possible contributors:
-- `MiniMaxM2Model.set_vocab()` is not overridden in `convert_hf_to_gguf.py` — it inherits the default `TextModel` path, which does not mark the 54 custom special tokens (e.g. `[e~[`, `]~b]system`, `]~b]ai`, `]~!b[`) used by the MiniMax chat template as `USER_DEFINED`. The chat-template parser relies on these being rendered verbatim.
-- The FP8 `float8_e4m3fn` block-quantization (128×128) dequantization path used for the source weights may not produce exactly the tensors that the runtime expects for the MoE expert routing.
-- The full `minimax-m2` graph (MoE routing + attention) may simply not match the reference implementation yet.
-Because the same bug reproduces across bartowski/unsloth/our quants — and the raw (non-chat) completion endpoint also produces garbage — re-quantizing cannot fix this. A fix has to come from upstream llama.cpp.
-## What to do if you want to run MiniMax-M2.7
-- **Run the native weights** with vLLM, SGLang, or transformers directly — those backends handle the `minimax-m2` architecture correctly.
-- **Wait for a fix** in llama.cpp. Track [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/issues?q=minimax) for updates.
-- When llama.cpp support matures, we'll re-publish APEX quants here.
-## About APEX
-APEX (Adaptive Precision for EXpert Models) is a quantization strategy for Mixture-of-Experts models that classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient — edge layers get higher precision, middle layers get more aggressive compression. I-variants additionally use diverse imatrix calibration.
-See the [APEX project](https://github.com/mudler/apex-quant) for details, technical report, and scripts. Working APEX quants are available for many other MoE models — see e.g. [mudler/gemma-4-26B-A4B-it-APEX-GGUF](https://huggingface.co/mudler/gemma-4-26B-A4B-it-APEX-GGUF).
 ## Credits

   - minimax
 ---
+# MiniMax-M2.7 APEX GGUF
+**APEX (Adaptive Precision for EXpert Models)** quantizations of [MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7).
+**Brought to you by the [LocalAI](https://github.com/mudler/LocalAI) team** | [APEX Project](https://github.com/mudler/apex-quant) | [Technical Report](https://github.com/mudler/apex-quant/blob/main/paper/APEX_Technical_Report.pdf)
+> **Status: Re-quantization in progress.** The previous quants had a conversion bug (our direct FP8→BF16 path produced broken logits). We've identified the issue — using unsloth's pre-converted BF16 GGUF as the source instead — and are re-quantizing. Working quants will be back shortly.
+## About APEX
+APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient — edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration.
+See the [APEX project](https://github.com/mudler/apex-quant) for full details, technical report, and scripts.
+## Architecture
+- **Model**: MiniMax-M2.7 (MiniMaxM2)
+- **Layers**: 62
+- **Experts**: 256 routed (8 active per token)
+- **Total Parameters**: ~228B
+- **Active Parameters**: ~10B per token
 ## Credits