| | --- |
| | tags: |
| | - moe |
| | - minimax |
| | - bfloat16 |
| | - sglang |
| | - mlx |
| | license: mit |
| | datasets: |
| | - nick007x/github-code-2025 |
| | - tatsu-lab/alpaca |
| | base_model: |
| | - MiniMaxAI/MiniMax-M2 |
| | --- |
| |  |
| |
|
| | # VibeStudio/MiniMax-M2-THRIFT-55-v1 |
| |
|
| | **Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned** |
| |
|
| | A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments. |
| |
|
| | ## TLDR |
| |
|
| | * **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation. |
| | * **Why:** Push the efficiency frontier for compact, responsive deployments. |
| | * **Now:** Ready for experimentation with solid coverage across core evals and more on the way. |
| |
|
| | --- |
| |
|
| | ## Why it’s useful |
| |
|
| | * **Lower latency:** Fast, responsive interactions for interactive apps and tools. |
| | * **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density. |
| | * **Higher throughput:** Serve more concurrent users on the same hardware. |
| | * **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API. |
| | * **Adaptable:** Plays well with light fine-tuning to match domain and style. |
| |
|
| | ## Intended use |
| |
|
| | * Local/air-gapped assistants and dev tools |
| | * Cost-sensitive batches and realtime services |
| | * Edge and on-prem deployments prioritizing efficiency |
| |
|
| | --- |
| |
|
| | ## How Our Approach Works |
| |
|
| | > **Active research in progress** — we continue to iterate and expand ablations. |
| |
|
| | * **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student. |
| | * **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts. |
| | * **Distill after each prune:** Retrain the student to imitate the teacher on |
| |
|
| | * **Outputs** (token probability distributions), |
| | * **Hidden states**, and |
| | * **Router behavior** over the **surviving experts**. |
| |
|
| | --- |
| |
|
| | **Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)** |
| | https://github.com/latent-variable/minimax-agent-guide |