hornsan1's picture
Update README.md
d66b662 verified
|
Raw
History Blame Contribute Delete
3.04 kB
---
license: other
base_model: MiniMaxAI/MiniMax-M3
tags: [mlx, vmlx, jang, reap, awq, moe, code, multimodal, minimax-m3, apple-silicon]
pipeline_tag: text-generation
---
<p align="center"><img src="./vmlx-logo.png" alt="vMLX" width="150"></p>
<h1 align="center">MiniMax-M3-REAP22-Coder</h1>
<p align="center"><b>A JANG-quantized MiniMax-M3 — coding/agentic + multimodal — for the <a href="https://mlx.studio">vMLX</a> engine (Apple Silicon / MLX).</b></p>
> ⚠️ **Requires vMLX engine v1.5.67 or newer.**
> This is a **JANG-format** model (JANG affine-mixed + **AWQ** quantization, **REAP** expert pruning, and the
> MiniMax-M3 MSA / Lightning-Indexer runtime). It will **NOT** load with `transformers`, `vLLM`, or generic MLX
> loaders — it needs vMLX's JANG loader + the M3 runtime. **Coder support lands in vMLX ≥ 1.5.67.**
## What is a JANG model?
**JANG** is vMLX's quantization + packing format: mixed-precision **affine** quantization (per-projection bit
widths) + **AWQ** activation-aware scaling + **REAP** expert pruning, described by a `jang_config.json`. Weights
stay quantized in GPU memory and are loaded by vMLX's JANG loader. Because the format **and** the MiniMax-M3
runtime (MSA dual-cache, Lightning Indexer, partial RoPE, vision tower) are vMLX-specific, **these models run
only on vMLX ≥ 1.5.67.**
## Run it
1. Install/update **vMLX 1.5.67+** — https://mlx.studio (or `pip install -U vmlx`).
2. App: **Server → New Session →** pick/download this model **→ Start →** chat.
3. CLI: `vmlx-engine serve JANGQ-AI/MiniMax-M3-REAP22-Coder --reasoning-parser minimax_m3 --tool-call-parser minimax_m3`
## Highlights
- **Coding: HumanEval pass@1 = 100%** (81/81 on a scrambled half of HumanEval, first-sample) — pass@5 = 1.000.
- **Arithmetic/reasoning recovered** vs the base REAP quant (with reasoning enabled): ~7/7 on a 7-task probe.
- **Multimodal (vision) kept.** ~107 GB on disk.
## Build
- **Base:** MiniMaxAI/MiniMax-M3 (60 layers, MoE, MSA Lightning Indexer, GQA, partial RoPE).
- **REAP pruning:** keep **100/128** routed experts per MoE layer (22% pruned), saliency-scored.
- **JANG affine quant (group_size 64):** routed gate/up = **2-bit + AWQ pre-scaling**, down = 2-bit; shared
experts 6-bit; attention 8-bit; embeddings 6-bit; lm_head 8-bit; Lightning Indexer + norms FP16; vision 8-bit.
- **"Floor" expert recipe:** protect the proven coding experts (coding saliency) + add top math experts, so
coding stays intact while math improves.
- **Calibration:** Vera (agentic-coder) dominant + GSM8K (math reasoning).
## Attribution
- Base model: **MiniMaxAI/MiniMax-M3**
- Expert pruning: **REAP** (Cerebras, ICLR 2026, arXiv:2510.13999)
- **Vera agentic-coder calibration dataset + evaluation/testing: [@hornsman1](https://huggingface.co/hornsman1) (hornsan1 on GitHub)**
- Additional math-reasoning calibration: GSM8K
- Quantization & runtime: **JANG / vMLX**
## Credits
- Vera dataset & model testing: **@hornsby_andrew** ([hornsan1](https://github.com/hornsan1) on GitHub)