|
|
--- |
|
|
tags: |
|
|
- moe |
|
|
- minimax |
|
|
- bfloat16 |
|
|
- sglang |
|
|
- mlx |
|
|
license: mit |
|
|
datasets: |
|
|
- nick007x/github-code-2025 |
|
|
- tatsu-lab/alpaca |
|
|
base_model: |
|
|
- MiniMaxAI/MiniMax-M2 |
|
|
--- |
|
|
 |
|
|
|
|
|
# VibeStudio/MiniMax-M2-THRIFT-55-v1 |
|
|
|
|
|
**Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned** |
|
|
|
|
|
A lean, efficiency-first variant of MiniMax-M2 designed to maximize **latency, throughput, and VRAM savings** for local, on-prem, and edge deployments. |
|
|
|
|
|
## TLDR |
|
|
|
|
|
* **What:** ~55% expert-pruned MoE with staged pruning + knowledge distillation. |
|
|
* **Why:** Push the efficiency frontier for compact, responsive deployments. |
|
|
* **Now:** Ready for experimentation with solid coverage across core evals and more on the way. |
|
|
|
|
|
--- |
|
|
|
|
|
## Why it’s useful |
|
|
|
|
|
* **Lower latency:** Fast, responsive interactions for interactive apps and tools. |
|
|
* **Smaller memory footprint:** Fits tighter VRAM budgets and increases node density. |
|
|
* **Higher throughput:** Serve more concurrent users on the same hardware. |
|
|
* **Deployment-friendly:** Smooth drop-in via SGLang with OpenAI-compatible API. |
|
|
* **Adaptable:** Plays well with light fine-tuning to match domain and style. |
|
|
|
|
|
## Intended use |
|
|
|
|
|
* Local/air-gapped assistants and dev tools |
|
|
* Cost-sensitive batches and realtime services |
|
|
* Edge and on-prem deployments prioritizing efficiency |
|
|
|
|
|
--- |
|
|
|
|
|
## How Our Approach Works |
|
|
|
|
|
> **Active research in progress** — we continue to iterate and expand ablations. |
|
|
|
|
|
* **Teacher–student setup:** Start with **MiniMax-M2** as teacher and a copy as student. |
|
|
* **Gradual expert pruning:** Remove **≈5% experts per stage** over **~11 stages** (≈**55% total**), guided by importance scores with a lightweight **Leave-One-Expert-Out** check to retain rare-but-important experts. |
|
|
* **Distill after each prune:** Retrain the student to imitate the teacher on |
|
|
|
|
|
* **Outputs** (token probability distributions), |
|
|
* **Hidden states**, and |
|
|
* **Router behavior** over the **surviving experts**. |
|
|
|
|
|
--- |
|
|
|
|
|
**Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)** |
|
|
https://github.com/latent-variable/minimax-agent-guide |