Text Generation
MLX
Safetensors
English
minimax_m2
turboquant
turboquant-plus
config-i
Mixture of Experts
apple-silicon
conversational
custom_code
4-bit precision
Instructions to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "thetom-ai/MiniMax-M2.7-ConfigI-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default thetom-ai/MiniMax-M2.7-ConfigI-MLX
Run Hermes
hermes
- MLX LM
How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thetom-ai/MiniMax-M2.7-ConfigI-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
| base_model: MiniMaxAI/MiniMax-M2.7 | |
| language: | |
| - en | |
| license: other | |
| license_name: minimax-m2.7-non-commercial | |
| license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE | |
| pipeline_tag: text-generation | |
| tags: | |
| - mlx | |
| - turboquant | |
| - turboquant-plus | |
| - config-i | |
| - moe | |
| - apple-silicon | |
| quantized_by: thetom-ai | |
| inference: false | |
| # MiniMax-M2.7 -TurboQuant+ Config-I (MLX) | |
| **93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604.** 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock `mlx_lm` and `mlx-swift-lm`. No custom loaders required. | |
| Config-I quantization of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the [Config-I paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) for the policy derivation. | |
| ## Compression | |
| | | Size | | |
| |---|---| | |
| | FP8 source | 230 GB | | |
| | **Config-I (3.25 bpw)** | **87 GB** | | |
| | **Reduction** | **62%** | | |
| ## Quality | |
| **Perplexity:** 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression) | |
| **MMLU (200q, single-pass, reasoning ON):** | |
| | Subject | Score | | |
| |---|---| | |
| | Abstract Algebra | 18/20 | | |
| | Anatomy | 19/20 | | |
| | Astronomy | 19/20 | | |
| | College CS | 18/20 | | |
| | College Physics | 19/20 | | |
| | HS Biology | 20/20 | | |
| | HS Chemistry | 17/20 | | |
| | HS Math | 20/20 | | |
| | Logical Fallacies | 19/20 | | |
| | World Religions | 18/20 | | |
| | **TOTAL** | **187/200 (93.5%)** | | |
| Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with `mlx_lm` on Apple M5 Max 128 GB. | |
| **NIAH (Needle in a Haystack):** 12/12 (100%) | |
| | Context | 10% depth | 50% depth | 90% depth | | |
| |---------|-----------|-----------|-----------| | |
| | 1.4K | ✓ | ✓ | ✓ | | |
| | 2.4K | ✓ | ✓ | ✓ | | |
| | 4.4K | ✓ | ✓ | ✓ | | |
| | 8.3K | ✓ | ✓ | ✓ | | |
| ## Speed (Apple M5 Max 128 GB) | |
| All benchmarks with turbo4v2 KV compression enabled. Measured with [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning) (`ek/tom-eric-moe-tuning` branch). | |
| ### Prefill | |
| The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts. | |
| | Context | Bridge + turbo4v2 | Swift + turbo4v2 | Swift vanilla | Bridge vs Swift turbo4 | | |
| |---------|-------------------|------------------|---------------|------------------------| | |
| | 128 | 199 t/s | 185 t/s | 185 t/s | +8% | | |
| | 256 | 281 t/s | 267 t/s | 267 t/s | +5% | | |
| | 512 | 368 t/s | 293 t/s | 293 t/s | +26% | | |
| | 1024 | 462 t/s | 351 t/s | 351 t/s | +32% | | |
| | 2048 | 510 t/s | 430 t/s | 430 t/s | +19% | | |
| | 4096 | 514 t/s | 468 t/s | 468 t/s | +10% | | |
| | 8192 | 477 t/s | 436 t/s | 436 t/s | +9% | | |
| | 16384 | 396 t/s | 267 t/s | 267 t/s | +48% | | |
| Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical. | |
| ### Decode | |
| | Context | Bridge + turbo4v2 | Swift + turbo4v2 | | |
| |---------|-------------------|------------------| | |
| | 128 | 59.2 t/s | 61.1 t/s | | |
| | 256 | 58.7 t/s | 60.5 t/s | | |
| | 512 | 56.6 t/s | 58.5 t/s | | |
| | 1024 | 54.7 t/s | 57.4 t/s | | |
| | 2048 | 53.4 t/s | 50.0 t/s | | |
| | 4096 | 50.0 t/s | 52.1 t/s | | |
| | 8192 | 44.4 t/s | 45.4 t/s | | |
| | 16384 | 37.3 t/s | 36.9 t/s | | |
| Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K. | |
| ## TurboQuant KV Cache Compression | |
| With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, **KV cache is the bottleneck, not the model.** A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim. | |
| | Context | bf16 KV | turbo4v2 KV | Saved | | |
| |---------|---------|-------------|-------| | |
| | 8K | 7.9 GB | 1.5 GB | 6.4 GB | | |
| | 16K | 15.8 GB | 3.0 GB | 12.8 GB | | |
| | 32K | 31.6 GB | 6.0 GB | 25.6 GB | | |
| | 64K | 63.2 GB | 11.9 GB | 51.3 GB | | |
| | 128K | 126.4 GB | 23.9 GB | 102.5 GB | | |
| **Max context on 128 GB M5 Max** (87 GB model, ~36 GB free): | |
| - bf16 KV: **149K tokens** | |
| - turbo4v2 KV: **595K tokens** (4x more context) | |
| The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty. | |
| ## Config-I Policy (MiniMax M2.7 Adaptation) | |
| | Component | Bits | Layers | Rationale | | |
| |-----------|------|--------|-----------| | |
| | Expert MLP gate/up | **2-bit** | middle 58 | 98%+ of params, MoE-tolerant | | |
| | Expert MLP down | **3-bit** | middle 58 | Write-back sensitivity (Config-I finding) | | |
| | Attention Q/K/V/O | **4-bit** | middle 58 | Uniform per layer | | |
| | Boundary (all tensors) | **8-bit** | first 2 + last 2 | Boundary layer protection | | |
| | MoE router | **f16** | all | Routing precision critical | | |
| | Embeddings + lm_head | **8-bit** | -| Protected | | |
| Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it. | |
| ## Compatibility | |
| | Field | Value | | |
| |-------|-------| | |
| | Format | MLX safetensors (standard) | | |
| | Avg bits | 3.249 bpw | | |
| | Runtime | `mlx_lm` (Python), `mlx-swift-lm` (Swift) | | |
| | Platform | Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+) | | |
| | Quantized on | 2026-04-12 | | |
| **No custom loader needed.** This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with `config.json` quantization metadata will work. | |
| ## How to Run | |
| ### Python (mlx_lm) | |
| ```bash | |
| pip install mlx-lm | |
| python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello" | |
| ``` | |
| ```python | |
| from mlx_lm import load, generate | |
| model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX") | |
| print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95)) | |
| ``` | |
| ### Swift (mlx-swift-lm) -TurboQuant KV compression | |
| > **Note:** Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work. | |
| For the speed and KV compression results above, use [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning). | |
| **In code:** | |
| ```swift | |
| import MLXLLM | |
| let container = try await LLMModelFactory.shared.loadContainer( | |
| configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX")) | |
| let result = try await container.generate( | |
| input: .init(text: .init(tokens: tokenArray)), | |
| parameters: GenerateParameters(temperature: 1.0)) | |
| ``` | |
| **As an OpenAI-compatible server:** | |
| ```bash | |
| git clone https://github.com/ekryski/mlx-swift-lm.git | |
| cd mlx-swift-lm | |
| git checkout ek/tom-eric-moe-tuning | |
| swift build -c release | |
| # Download the model | |
| hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX | |
| # Run server | |
| .build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080 | |
| # Test | |
| curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \ | |
| -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}' | |
| ``` | |
| > **Important:** MiniMax M2.7 is an always-reasoning model. Use `temperature=1.0` -greedy/temp=0 causes infinite thinking loops. | |
| ### Hermes AI Agent | |
| With the MLXServer running on port 8080, add this to `~/.hermes/config.yaml`: | |
| ```yaml | |
| model: | |
| default: local | |
| provider: custom | |
| base_url: http://localhost:8080/v1 | |
| context_length: 196608 | |
| ``` | |
| Then just run `hermes`. It will use whatever model is loaded on the server. | |
| ## What is Config-I? | |
| Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: **compression policy matters more than compression math** -which tensors to compress, which to protect, and how aggressively. | |
| Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by [independent third-party implementations](https://github.com/dhawalc/turboQuantDC). | |
| For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision. | |
| - [Config-I Paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) | |
| - [Getting Started Guide](https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md) | |
| - [TurboQuant+ Repository](https://github.com/TheTom/turboquant_plus) | |
| --- | |
| Quantized by [@thetom-ai](https://huggingface.co/thetom-ai) | [GitHub](https://github.com/TheTom) | [X](https://x.com/no_stp_on_snek) | [Sponsor](https://github.com/sponsors/TheTom) | |