Text Generation
MLX
Safetensors
English
Chinese
minimax_m2
jang
jangtq
minimax
Mixture of Experts
apple-silicon
2bit
turboquant
conversational
custom_code
Instructions to use OsaurusAI/MiniMax-M2.7-JANGTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/MiniMax-M2.7-JANGTQ with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/MiniMax-M2.7-JANGTQ") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/MiniMax-M2.7-JANGTQ with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/MiniMax-M2.7-JANGTQ" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/MiniMax-M2.7-JANGTQ with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/MiniMax-M2.7-JANGTQ
Run Hermes
hermes
- MLX LM
How to use OsaurusAI/MiniMax-M2.7-JANGTQ with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/MiniMax-M2.7-JANGTQ"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/MiniMax-M2.7-JANGTQ" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/MiniMax-M2.7-JANGTQ", "messages": [ {"role": "user", "content": "Hello"} ] }'
docs: prepend REQUIRED warning — jangtq_runtime.safetensors sidecar must be downloaded
5ccc07e verified | language: | |
| - en | |
| - zh | |
| library_name: mlx | |
| license: mit | |
| pipeline_tag: text-generation | |
| base_model: MiniMaxAI/MiniMax-M2.7 | |
| base_model_relation: quantized | |
| tags: | |
| - mlx | |
| - jang | |
| - jangtq | |
| - minimax | |
| - minimax_m2 | |
| - moe | |
| - apple-silicon | |
| - 2bit | |
| - turboquant | |
| > ## ⚠️ REQUIRED — `jangtq_runtime.safetensors` sidecar must be downloaded | |
| > | |
| > Osaurus uses the native Swift JANGTQ runtime. **Every JANGTQ bundle on | |
| > OsaurusAI ships a small `jangtq_runtime.safetensors` sidecar (~10 KB–~165 KB) | |
| > alongside the weight shards.** The Swift loader will refuse to start with | |
| > the error | |
| > ``` | |
| > Error: Model '<name>' declares JANGTQ (weight_format: "mxtq") but is | |
| > missing required sidecar file 'jangtq_runtime.safetensors'. | |
| > Re-download the full model or obtain the sidecar from the original | |
| > publisher. | |
| > ``` | |
| > if the file is absent. | |
| > | |
| > If your local copy doesn't have it (older download, partial sync, etc): | |
| > ```bash | |
| > hf download OsaurusAI/MiniMax-M2.7-JANGTQ jangtq_runtime.safetensors --local-dir <your-dir> | |
| > ``` | |
| > The file holds the deterministic codebooks + Hadamard rotation signs the | |
| > Swift loader uses to decode `*.tq_packed` weights. It must match the seed | |
| > the bundle was quantized with (`mxtq_seed=42`). | |
| <p align="center"> | |
| <a href="https://osaurus.ai"><img src="./osaurus-x-banner.png" alt="Osaurus AI"></a> | |
| </p> | |
| <h3 align="center">MiniMax M2.7 — JANGTQ (MLX)</h3> | |
| <p align="center">TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine.</p> | |
| <p align="center"> | |
| <a href="https://osaurus.ai"><img src="https://img.shields.io/badge/Web-osaurus.ai-blue" alt="Website"></a> | |
| <a href="https://huggingface.co/OsaurusAI"><img src="https://img.shields.io/badge/HF-OsaurusAI-yellow?logo=huggingface" alt="OsaurusAI"></a> | |
| </p> | |
| --- | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | **Base Model** | MiniMaxAI/MiniMax-M2.7 | | |
| | **Architecture** | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE | | |
| | **Total Parameters** | 228.7 B | | |
| | **Active per Token** | ~1.4 B | | |
| | **Profile** | JANGTQ | | |
| | **Format** | JANGTQ (codebook + Hadamard) — `weight_format: mxtq` in `jang_config.json` | | |
| | **Avg bits/param** | ~2.15 | | |
| | **Disk** | ~57 GB | | |
| | **Context length** | 192 K tokens | | |
| | **Chat template** | Always-reasoning (`<think>` opened at assistant start) | | |
| ## What is JANGTQ? | |
| **JANGTQ** (JANG TurboQuant) is a codebook-based quantization format for MoE | |
| models on Apple Silicon. Routed expert weights stay in a compact **codebook + | |
| Hadamard-rotated** form at runtime — no decompression to affine — and the | |
| matmul path uses custom Metal kernels that read packed `uint32` weights, look | |
| up centroids in a small codebook, and accumulate dot products against a | |
| Hadamard-rotated input (QuIP# *rotate-input-once* math). | |
| **Result vs uniform 2-bit affine:** smaller on disk, higher quality, runs at | |
| ~89 % of affine 2-bit speed. | |
| ## Bit Allocation | |
| | Component | Bits | Format | | |
| |---|:---:|---| | |
| | Routed expert MLP (gate / up / down) | **2** | JANGTQ codebook + Hadamard | | |
| | Attention (Q / K / V / O) | 8 | Affine (`nn.QuantizedLinear`, group_size=64) | | |
| | Shared expert | 8 | Affine | | |
| | Embed tokens / LM head | 8 | Affine | | |
| | Router gate | fp16 | Unquantized `nn.Linear` | | |
| | RMSNorms / RoPE / biases | fp16 | Unquantized | | |
| The routed experts are 98 % of parameters and the natural compression target. | |
| Everything else stays at 8-bit affine so the quality-critical hot path runs | |
| at full precision. | |
| ## Important Settings | |
| MiniMax M2.7 is an **always-reasoning** model. The chat template | |
| unconditionally opens `<think>` at each assistant turn. | |
| | Setting | Value | Notes | | |
| |---|---|---| | |
| | Temperature | **1.0** | Required — `temp=0` can cause thinking loops | | |
| | Top-P | 0.95 | | | |
| | Top-K | 40 | | | |
| | Repetition Penalty | 1.1 | Optional, helps prevent loops | | |
| | `max_tokens` | ≥ 8192 | Give reasoning room to converge | | |
| Strip `<think>…</think>` from the response before using the final answer. | |
| ## Usage | |
| This model requires the `jang-tools` loader — stock `mlx_lm.load()` does not | |
| recognize `weight_format: mxtq`. The loader applies Metal kernel | |
| monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block | |
| Hadamard, router compile, QKV fusion). | |
| ```bash | |
| pip install jang-tools | |
| ``` | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| from jang_tools.load_jangtq import load_jangtq_model | |
| from mlx_lm import generate | |
| model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ") | |
| model, tokenizer = load_jangtq_model(model_path) | |
| messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}] | |
| prompt = tokenizer.apply_chat_template( | |
| messages, add_generation_prompt=True, tokenize=False | |
| ) | |
| out = generate(model, tokenizer, prompt, max_tokens=600, | |
| temperature=1.0, verbose=True) | |
| ``` | |
| ### Swift — Osaurus / MLX Studio | |
| Both clients auto-detect the JANGTQ runtime from `jang_config.json` and route | |
| through the `MiniMaxJANGTQModel` class. Just load the repo — no extra flags. | |
| ## What's In This Repo | |
| | File | Role | | |
| |---|---| | |
| | `model-*.safetensors` (61 shards, ~57 GB) | Weights — 2-bit routed TQ + 8-bit affine | | |
| | `model.safetensors.index.json` | Shard index | | |
| | `jangtq_runtime.safetensors` | Codebooks + Hadamard signs sidecar (Swift loader) | | |
| | `jang_config.json` | JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`) | | |
| | `config.json` | HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=2`) | | |
| | `chat_template.jinja`, `tokenizer.*`, `vocab.json`, `merges.txt` | Tokenizer + chat template | | |
| | `configuration_minimax_m2.py`, `modeling_minimax_m2.py` | HF custom code (untouched from upstream) | | |
| | `osaurus-x-banner.png`, `mlx-studio-logo.png` | Branding assets | | |
| ## Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx) | |
| ```json | |
| { | |
| "reasoning_parser": "qwen3", | |
| "tool_parser": "minimax", | |
| "think_in_template": true, | |
| "supports_tools": true, | |
| "supports_thinking": true, | |
| "family": "minimax_m2", | |
| "modality": "text", | |
| "cache_type": "kv" | |
| } | |
| ``` | |
| `<think>` and `<tool_call>` are non-special tokens by design — the | |
| application layer parses them. Osaurus and `vmlx` `CapabilityDetector` read | |
| this block verbatim and wire the `qwen3` reasoning parser + `minimax` tool | |
| parser automatically, so streamed responses route `reasoning_content` and | |
| `tool_calls` into the OpenAI-compatible SSE fields instead of leaking into | |
| `content`. | |
| ## License | |
| MIT — see [`LICENSE`](./LICENSE). | |
| ## Credits | |
| Created by [Jinho Jang](https://twitter.com/jangq_ai) — `eric@jangq.ai` | |
| Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI. | |