--- base_model: MiniMaxAI/MiniMax-M2.7 language: - en license: other license_name: minimax-m2.7-non-commercial license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE pipeline_tag: text-generation tags: - mlx - turboquant - turboquant-plus - config-i - moe - apple-silicon quantized_by: thetom-ai inference: false --- # MiniMax-M2.7 -TurboQuant+ Config-I (MLX) **93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604.** 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock `mlx_lm` and `mlx-swift-lm`. No custom loaders required. Config-I quantization of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the [Config-I paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) for the policy derivation. ## Compression | | Size | |---|---| | FP8 source | 230 GB | | **Config-I (3.25 bpw)** | **87 GB** | | **Reduction** | **62%** | ## Quality **Perplexity:** 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression) **MMLU (200q, single-pass, reasoning ON):** | Subject | Score | |---|---| | Abstract Algebra | 18/20 | | Anatomy | 19/20 | | Astronomy | 19/20 | | College CS | 18/20 | | College Physics | 19/20 | | HS Biology | 20/20 | | HS Chemistry | 17/20 | | HS Math | 20/20 | | Logical Fallacies | 19/20 | | World Religions | 18/20 | | **TOTAL** | **187/200 (93.5%)** | Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with `mlx_lm` on Apple M5 Max 128 GB. **NIAH (Needle in a Haystack):** 12/12 (100%) | Context | 10% depth | 50% depth | 90% depth | |---------|-----------|-----------|-----------| | 1.4K | ✓ | ✓ | ✓ | | 2.4K | ✓ | ✓ | ✓ | | 4.4K | ✓ | ✓ | ✓ | | 8.3K | ✓ | ✓ | ✓ | ## Speed (Apple M5 Max 128 GB) All benchmarks with turbo4v2 KV compression enabled. Measured with [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning) (`ek/tom-eric-moe-tuning` branch). ### Prefill The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts. | Context | Bridge + turbo4v2 | Swift + turbo4v2 | Swift vanilla | Bridge vs Swift turbo4 | |---------|-------------------|------------------|---------------|------------------------| | 128 | 199 t/s | 185 t/s | 185 t/s | +8% | | 256 | 281 t/s | 267 t/s | 267 t/s | +5% | | 512 | 368 t/s | 293 t/s | 293 t/s | +26% | | 1024 | 462 t/s | 351 t/s | 351 t/s | +32% | | 2048 | 510 t/s | 430 t/s | 430 t/s | +19% | | 4096 | 514 t/s | 468 t/s | 468 t/s | +10% | | 8192 | 477 t/s | 436 t/s | 436 t/s | +9% | | 16384 | 396 t/s | 267 t/s | 267 t/s | +48% | Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical. ### Decode | Context | Bridge + turbo4v2 | Swift + turbo4v2 | |---------|-------------------|------------------| | 128 | 59.2 t/s | 61.1 t/s | | 256 | 58.7 t/s | 60.5 t/s | | 512 | 56.6 t/s | 58.5 t/s | | 1024 | 54.7 t/s | 57.4 t/s | | 2048 | 53.4 t/s | 50.0 t/s | | 4096 | 50.0 t/s | 52.1 t/s | | 8192 | 44.4 t/s | 45.4 t/s | | 16384 | 37.3 t/s | 36.9 t/s | Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K. ## TurboQuant KV Cache Compression With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, **KV cache is the bottleneck, not the model.** A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim. | Context | bf16 KV | turbo4v2 KV | Saved | |---------|---------|-------------|-------| | 8K | 7.9 GB | 1.5 GB | 6.4 GB | | 16K | 15.8 GB | 3.0 GB | 12.8 GB | | 32K | 31.6 GB | 6.0 GB | 25.6 GB | | 64K | 63.2 GB | 11.9 GB | 51.3 GB | | 128K | 126.4 GB | 23.9 GB | 102.5 GB | **Max context on 128 GB M5 Max** (87 GB model, ~36 GB free): - bf16 KV: **149K tokens** - turbo4v2 KV: **595K tokens** (4x more context) The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty. ## Config-I Policy (MiniMax M2.7 Adaptation) | Component | Bits | Layers | Rationale | |-----------|------|--------|-----------| | Expert MLP gate/up | **2-bit** | middle 58 | 98%+ of params, MoE-tolerant | | Expert MLP down | **3-bit** | middle 58 | Write-back sensitivity (Config-I finding) | | Attention Q/K/V/O | **4-bit** | middle 58 | Uniform per layer | | Boundary (all tensors) | **8-bit** | first 2 + last 2 | Boundary layer protection | | MoE router | **f16** | all | Routing precision critical | | Embeddings + lm_head | **8-bit** | -| Protected | Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it. ## Compatibility | Field | Value | |-------|-------| | Format | MLX safetensors (standard) | | Avg bits | 3.249 bpw | | Runtime | `mlx_lm` (Python), `mlx-swift-lm` (Swift) | | Platform | Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+) | | Quantized on | 2026-04-12 | **No custom loader needed.** This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with `config.json` quantization metadata will work. ## How to Run ### Python (mlx_lm) ```bash pip install mlx-lm python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello" ``` ```python from mlx_lm import load, generate model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX") print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95)) ``` ### Swift (mlx-swift-lm) -TurboQuant KV compression > **Note:** Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work. For the speed and KV compression results above, use [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning). **In code:** ```swift import MLXLLM let container = try await LLMModelFactory.shared.loadContainer( configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX")) let result = try await container.generate( input: .init(text: .init(tokens: tokenArray)), parameters: GenerateParameters(temperature: 1.0)) ``` **As an OpenAI-compatible server:** ```bash git clone https://github.com/ekryski/mlx-swift-lm.git cd mlx-swift-lm git checkout ek/tom-eric-moe-tuning swift build -c release # Download the model hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX # Run server .build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080 # Test curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \ -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}' ``` > **Important:** MiniMax M2.7 is an always-reasoning model. Use `temperature=1.0` -greedy/temp=0 causes infinite thinking loops. ### Hermes AI Agent With the MLXServer running on port 8080, add this to `~/.hermes/config.yaml`: ```yaml model: default: local provider: custom base_url: http://localhost:8080/v1 context_length: 196608 ``` Then just run `hermes`. It will use whatever model is loaded on the server. ## What is Config-I? Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: **compression policy matters more than compression math** -which tensors to compress, which to protect, and how aggressively. Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by [independent third-party implementations](https://github.com/dhawalc/turboQuantDC). For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision. - [Config-I Paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) - [Getting Started Guide](https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md) - [TurboQuant+ Repository](https://github.com/TheTom/turboquant_plus) --- Quantized by [@thetom-ai](https://huggingface.co/thetom-ai) | [GitHub](https://github.com/TheTom) | [X](https://x.com/no_stp_on_snek) | [Sponsor](https://github.com/sponsors/TheTom)