thetom-ai's picture
Upload README.md with huggingface_hub
6641fda verified
---
base_model: MiniMaxAI/MiniMax-M2.7
language:
- en
license: other
license_name: minimax-m2.7-non-commercial
license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
pipeline_tag: text-generation
tags:
- mlx
- turboquant
- turboquant-plus
- config-i
- moe
- apple-silicon
quantized_by: thetom-ai
inference: false
---
# MiniMax-M2.7 -TurboQuant+ Config-I (MLX)
**93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604.** 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock `mlx_lm` and `mlx-swift-lm`. No custom loaders required.
Config-I quantization of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the [Config-I paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) for the policy derivation.
## Compression
| | Size |
|---|---|
| FP8 source | 230 GB |
| **Config-I (3.25 bpw)** | **87 GB** |
| **Reduction** | **62%** |
## Quality
**Perplexity:** 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression)
**MMLU (200q, single-pass, reasoning ON):**
| Subject | Score |
|---|---|
| Abstract Algebra | 18/20 |
| Anatomy | 19/20 |
| Astronomy | 19/20 |
| College CS | 18/20 |
| College Physics | 19/20 |
| HS Biology | 20/20 |
| HS Chemistry | 17/20 |
| HS Math | 20/20 |
| Logical Fallacies | 19/20 |
| World Religions | 18/20 |
| **TOTAL** | **187/200 (93.5%)** |
Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with `mlx_lm` on Apple M5 Max 128 GB.
**NIAH (Needle in a Haystack):** 12/12 (100%)
| Context | 10% depth | 50% depth | 90% depth |
|---------|-----------|-----------|-----------|
| 1.4K | ✓ | ✓ | ✓ |
| 2.4K | ✓ | ✓ | ✓ |
| 4.4K | ✓ | ✓ | ✓ |
| 8.3K | ✓ | ✓ | ✓ |
## Speed (Apple M5 Max 128 GB)
All benchmarks with turbo4v2 KV compression enabled. Measured with [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning) (`ek/tom-eric-moe-tuning` branch).
### Prefill
The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts.
| Context | Bridge + turbo4v2 | Swift + turbo4v2 | Swift vanilla | Bridge vs Swift turbo4 |
|---------|-------------------|------------------|---------------|------------------------|
| 128 | 199 t/s | 185 t/s | 185 t/s | +8% |
| 256 | 281 t/s | 267 t/s | 267 t/s | +5% |
| 512 | 368 t/s | 293 t/s | 293 t/s | +26% |
| 1024 | 462 t/s | 351 t/s | 351 t/s | +32% |
| 2048 | 510 t/s | 430 t/s | 430 t/s | +19% |
| 4096 | 514 t/s | 468 t/s | 468 t/s | +10% |
| 8192 | 477 t/s | 436 t/s | 436 t/s | +9% |
| 16384 | 396 t/s | 267 t/s | 267 t/s | +48% |
Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical.
### Decode
| Context | Bridge + turbo4v2 | Swift + turbo4v2 |
|---------|-------------------|------------------|
| 128 | 59.2 t/s | 61.1 t/s |
| 256 | 58.7 t/s | 60.5 t/s |
| 512 | 56.6 t/s | 58.5 t/s |
| 1024 | 54.7 t/s | 57.4 t/s |
| 2048 | 53.4 t/s | 50.0 t/s |
| 4096 | 50.0 t/s | 52.1 t/s |
| 8192 | 44.4 t/s | 45.4 t/s |
| 16384 | 37.3 t/s | 36.9 t/s |
Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K.
## TurboQuant KV Cache Compression
With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, **KV cache is the bottleneck, not the model.** A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim.
| Context | bf16 KV | turbo4v2 KV | Saved |
|---------|---------|-------------|-------|
| 8K | 7.9 GB | 1.5 GB | 6.4 GB |
| 16K | 15.8 GB | 3.0 GB | 12.8 GB |
| 32K | 31.6 GB | 6.0 GB | 25.6 GB |
| 64K | 63.2 GB | 11.9 GB | 51.3 GB |
| 128K | 126.4 GB | 23.9 GB | 102.5 GB |
**Max context on 128 GB M5 Max** (87 GB model, ~36 GB free):
- bf16 KV: **149K tokens**
- turbo4v2 KV: **595K tokens** (4x more context)
The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty.
## Config-I Policy (MiniMax M2.7 Adaptation)
| Component | Bits | Layers | Rationale |
|-----------|------|--------|-----------|
| Expert MLP gate/up | **2-bit** | middle 58 | 98%+ of params, MoE-tolerant |
| Expert MLP down | **3-bit** | middle 58 | Write-back sensitivity (Config-I finding) |
| Attention Q/K/V/O | **4-bit** | middle 58 | Uniform per layer |
| Boundary (all tensors) | **8-bit** | first 2 + last 2 | Boundary layer protection |
| MoE router | **f16** | all | Routing precision critical |
| Embeddings + lm_head | **8-bit** | -| Protected |
Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it.
## Compatibility
| Field | Value |
|-------|-------|
| Format | MLX safetensors (standard) |
| Avg bits | 3.249 bpw |
| Runtime | `mlx_lm` (Python), `mlx-swift-lm` (Swift) |
| Platform | Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+) |
| Quantized on | 2026-04-12 |
**No custom loader needed.** This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with `config.json` quantization metadata will work.
## How to Run
### Python (mlx_lm)
```bash
pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello"
```
```python
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))
```
### Swift (mlx-swift-lm) -TurboQuant KV compression
> **Note:** Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work.
For the speed and KV compression results above, use [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning).
**In code:**
```swift
import MLXLLM
let container = try await LLMModelFactory.shared.loadContainer(
configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX"))
let result = try await container.generate(
input: .init(text: .init(tokens: tokenArray)),
parameters: GenerateParameters(temperature: 1.0))
```
**As an OpenAI-compatible server:**
```bash
git clone https://github.com/ekryski/mlx-swift-lm.git
cd mlx-swift-lm
git checkout ek/tom-eric-moe-tuning
swift build -c release
# Download the model
hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX
# Run server
.build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080
# Test
curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}'
```
> **Important:** MiniMax M2.7 is an always-reasoning model. Use `temperature=1.0` -greedy/temp=0 causes infinite thinking loops.
### Hermes AI Agent
With the MLXServer running on port 8080, add this to `~/.hermes/config.yaml`:
```yaml
model:
default: local
provider: custom
base_url: http://localhost:8080/v1
context_length: 196608
```
Then just run `hermes`. It will use whatever model is loaded on the server.
## What is Config-I?
Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: **compression policy matters more than compression math** -which tensors to compress, which to protect, and how aggressively.
Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by [independent third-party implementations](https://github.com/dhawalc/turboQuantDC).
For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.
- [Config-I Paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md)
- [Getting Started Guide](https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md)
- [TurboQuant+ Repository](https://github.com/TheTom/turboquant_plus)
---
Quantized by [@thetom-ai](https://huggingface.co/thetom-ai) | [GitHub](https://github.com/TheTom) | [X](https://x.com/no_stp_on_snek) | [Sponsor](https://github.com/sponsors/TheTom)