Instructions to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "thetom-ai/MiniMax-M2.7-ConfigI-MLX"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default thetom-ai/MiniMax-M2.7-ConfigI-MLX

Run Hermes

hermes

MLX LM

How to use thetom-ai/MiniMax-M2.7-ConfigI-MLX with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "thetom-ai/MiniMax-M2.7-ConfigI-MLX"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "thetom-ai/MiniMax-M2.7-ConfigI-MLX",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

MiniMax-M2.7-ConfigI-MLX / README.md

thetom-ai

Upload README.md with huggingface_hub

6641fda verified about 2 months ago

preview code

raw

history blame contribute delete

9.54 kB

	---
	base_model: MiniMaxAI/MiniMax-M2.7
	language:
	- en
	license: other
	license_name: minimax-m2.7-non-commercial
	license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE
	pipeline_tag: text-generation
	tags:
	- mlx
	- turboquant
	- turboquant-plus
	- config-i
	- moe
	- apple-silicon
	quantized_by: thetom-ai
	inference: false
	---

	# MiniMax-M2.7 -TurboQuant+ Config-I (MLX)

	93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604. 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock `mlx_lm` and `mlx-swift-lm`. No custom loaders required.

	Config-I quantization of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the [Config-I paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md) for the policy derivation.

	## Compression

	\| \| Size \|
	\|---\|---\|
	\| FP8 source \| 230 GB \|
	\| Config-I (3.25 bpw) \| 87 GB \|
	\| Reduction \| 62% \|

	## Quality

	Perplexity: 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression)

	MMLU (200q, single-pass, reasoning ON):

	\| Subject \| Score \|
	\|---\|---\|
	\| Abstract Algebra \| 18/20 \|
	\| Anatomy \| 19/20 \|
	\| Astronomy \| 19/20 \|
	\| College CS \| 18/20 \|
	\| College Physics \| 19/20 \|
	\| HS Biology \| 20/20 \|
	\| HS Chemistry \| 17/20 \|
	\| HS Math \| 20/20 \|
	\| Logical Fallacies \| 19/20 \|
	\| World Religions \| 18/20 \|
	\| TOTAL \| 187/200 (93.5%) \|

	Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with `mlx_lm` on Apple M5 Max 128 GB.

	NIAH (Needle in a Haystack): 12/12 (100%)

	\| Context \| 10% depth \| 50% depth \| 90% depth \|
	\|---------\|-----------\|-----------\|-----------\|
	\| 1.4K \| ✓ \| ✓ \| ✓ \|
	\| 2.4K \| ✓ \| ✓ \| ✓ \|
	\| 4.4K \| ✓ \| ✓ \| ✓ \|
	\| 8.3K \| ✓ \| ✓ \| ✓ \|

	## Speed (Apple M5 Max 128 GB)

	All benchmarks with turbo4v2 KV compression enabled. Measured with [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning) (`ek/tom-eric-moe-tuning` branch).

	### Prefill

	The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts.

	\| Context \| Bridge + turbo4v2 \| Swift + turbo4v2 \| Swift vanilla \| Bridge vs Swift turbo4 \|
	\|---------\|-------------------\|------------------\|---------------\|------------------------\|
	\| 128 \| 199 t/s \| 185 t/s \| 185 t/s \| +8% \|
	\| 256 \| 281 t/s \| 267 t/s \| 267 t/s \| +5% \|
	\| 512 \| 368 t/s \| 293 t/s \| 293 t/s \| +26% \|
	\| 1024 \| 462 t/s \| 351 t/s \| 351 t/s \| +32% \|
	\| 2048 \| 510 t/s \| 430 t/s \| 430 t/s \| +19% \|
	\| 4096 \| 514 t/s \| 468 t/s \| 468 t/s \| +10% \|
	\| 8192 \| 477 t/s \| 436 t/s \| 436 t/s \| +9% \|
	\| 16384 \| 396 t/s \| 267 t/s \| 267 t/s \| +48% \|

	Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical.

	### Decode

	\| Context \| Bridge + turbo4v2 \| Swift + turbo4v2 \|
	\|---------\|-------------------\|------------------\|
	\| 128 \| 59.2 t/s \| 61.1 t/s \|
	\| 256 \| 58.7 t/s \| 60.5 t/s \|
	\| 512 \| 56.6 t/s \| 58.5 t/s \|
	\| 1024 \| 54.7 t/s \| 57.4 t/s \|
	\| 2048 \| 53.4 t/s \| 50.0 t/s \|
	\| 4096 \| 50.0 t/s \| 52.1 t/s \|
	\| 8192 \| 44.4 t/s \| 45.4 t/s \|
	\| 16384 \| 37.3 t/s \| 36.9 t/s \|

	Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K.

	## TurboQuant KV Cache Compression

	With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, KV cache is the bottleneck, not the model. A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim.

	\| Context \| bf16 KV \| turbo4v2 KV \| Saved \|
	\|---------\|---------\|-------------\|-------\|
	\| 8K \| 7.9 GB \| 1.5 GB \| 6.4 GB \|
	\| 16K \| 15.8 GB \| 3.0 GB \| 12.8 GB \|
	\| 32K \| 31.6 GB \| 6.0 GB \| 25.6 GB \|
	\| 64K \| 63.2 GB \| 11.9 GB \| 51.3 GB \|
	\| 128K \| 126.4 GB \| 23.9 GB \| 102.5 GB \|

	Max context on 128 GB M5 Max (87 GB model, ~36 GB free):
	- bf16 KV: 149K tokens
	- turbo4v2 KV: 595K tokens (4x more context)

	The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty.

	## Config-I Policy (MiniMax M2.7 Adaptation)

	\| Component \| Bits \| Layers \| Rationale \|
	\|-----------\|------\|--------\|-----------\|
	\| Expert MLP gate/up \| 2-bit \| middle 58 \| 98%+ of params, MoE-tolerant \|
	\| Expert MLP down \| 3-bit \| middle 58 \| Write-back sensitivity (Config-I finding) \|
	\| Attention Q/K/V/O \| 4-bit \| middle 58 \| Uniform per layer \|
	\| Boundary (all tensors) \| 8-bit \| first 2 + last 2 \| Boundary layer protection \|
	\| MoE router \| f16 \| all \| Routing precision critical \|
	\| Embeddings + lm_head \| 8-bit \| -\| Protected \|

	Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it.

	## Compatibility

	\| Field \| Value \|
	\|-------\|-------\|
	\| Format \| MLX safetensors (standard) \|
	\| Avg bits \| 3.249 bpw \|
	\| Runtime \| `mlx_lm` (Python), `mlx-swift-lm` (Swift) \|
	\| Platform \| Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+) \|
	\| Quantized on \| 2026-04-12 \|

	No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with `config.json` quantization metadata will work.

	## How to Run

	### Python (mlx_lm)

	```bash
	pip install mlx-lm
	python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello"
	```

	```python
	from mlx_lm import load, generate
	model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")
	print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))
	```

	### Swift (mlx-swift-lm) -TurboQuant KV compression

	> Note: Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work.

	For the speed and KV compression results above, use [ekryski/mlx-swift-lm](https://github.com/ekryski/mlx-swift-lm/tree/ek/tom-eric-moe-tuning).

	In code:

	```swift
	import MLXLLM

	let container = try await LLMModelFactory.shared.loadContainer(
	configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX"))

	let result = try await container.generate(
	input: .init(text: .init(tokens: tokenArray)),
	parameters: GenerateParameters(temperature: 1.0))
	```

	As an OpenAI-compatible server:

	```bash
	git clone https://github.com/ekryski/mlx-swift-lm.git
	cd mlx-swift-lm
	git checkout ek/tom-eric-moe-tuning
	swift build -c release

	# Download the model
	hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX

	# Run server
	.build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080

	# Test
	curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \
	-d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}'
	```

	> Important: MiniMax M2.7 is an always-reasoning model. Use `temperature=1.0` -greedy/temp=0 causes infinite thinking loops.

	### Hermes AI Agent

	With the MLXServer running on port 8080, add this to `~/.hermes/config.yaml`:

	```yaml
	model:
	default: local
	provider: custom
	base_url: http://localhost:8080/v1
	context_length: 196608
	```

	Then just run `hermes`. It will use whatever model is loaded on the server.

	## What is Config-I?

	Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math -which tensors to compress, which to protect, and how aggressively.

	Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by [independent third-party implementations](https://github.com/dhawalc/turboQuantDC).

	For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

	- [Config-I Paper](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md)
	- [Getting Started Guide](https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md)
	- [TurboQuant+ Repository](https://github.com/TheTom/turboquant_plus)

	---

	Quantized by [@thetom-ai](https://huggingface.co/thetom-ai) \| [GitHub](https://github.com/TheTom) \| [X](https://x.com/no_stp_on_snek) \| [Sponsor](https://github.com/sponsors/TheTom)