| --- |
| tags: |
| - moe |
| - minimax |
| - bfloat16 |
| - sglang |
| - mlx |
| license: mit |
| datasets: |
| - nick007x/github-code-2025 |
| - tatsu-lab/alpaca |
| base_model: |
| - MiniMaxAI/MiniMax-M2 |
| --- |
|  |
|
|
| # VibeStudio/MiniMax-M2-THRIFT-55-v1 |
|
|
| **Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned** |
|
|
| A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments. |
|
|
| ## TLDR |
|
|
| * **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation. |
| * **Why:** Push the efficiency frontier for compact, responsive deployments. |
| * **Now:** Ready for experimentation with solid coverage across core evals and more on the way. |
|
|
| --- |
|
|
| ## Why it’s useful |
|
|
| * **Lower latency:** Fast, responsive interactions for interactive apps and tools. |
| * **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density. |
| * **Higher throughput:** Serve more concurrent users on the same hardware. |
| * **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API. |
| * **Adaptable:** Plays well with light fine-tuning to match domain and style. |
|
|
| ## Intended use |
|
|
| * Local/air-gapped assistants and dev tools |
| * Cost-sensitive batches and realtime services |
| * Edge and on-prem deployments prioritizing efficiency |
|
|
| --- |
|
|
| ## How Our Approach Works |
|
|
| > **Active research in progress** — we continue to iterate and expand ablations. |
|
|
| * **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student. |
| * **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts. |
| * **Distill after each prune:** Retrain the student to imitate the teacher on |
|
|
| * **Outputs** (token probability distributions), |
| * **Hidden states**, and |
| * **Router behavior** over the **surviving experts**. |
|
|
| --- |
|
|
| **Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)** |
| https://github.com/latent-variable/minimax-agent-guide |