VibeStudio
/

MiniMax-M2-THRIFT-55-MLX-4bit

Mixture of Experts

4-bit precision

Model card Files Files and versions

MiniMax-M2-THRIFT-55-MLX-4bit / README.md

vibestudio-HQ's picture

Update README.md

57a8caf verified 3 months ago

|

history blame contribute delete

2.17 kB

	---
	tags:
	- moe
	- minimax
	- bfloat16
	- sglang
	- mlx
	license: mit
	datasets:
	- nick007x/github-code-2025
	- tatsu-lab/alpaca
	base_model:
	- MiniMaxAI/MiniMax-M2
	---
	![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)

	# VibeStudio/MiniMax-M2-THRIFT-55-v1

	Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned

	A lean, efficiency-first variant of MiniMax-M2 designed to maximize latency, throughput, and VRAM savings for local, on-prem, and edge deployments.

	## TLDR

	* What: ~55% expert-pruned MoE with staged pruning + knowledge distillation.
	* Why: Push the efficiency frontier for compact, responsive deployments.
	* Now: Ready for experimentation with solid coverage across core evals and more on the way.

	---

	## Why it’s useful

	* Lower latency: Fast, responsive interactions for interactive apps and tools.
	* Smaller memory footprint: Fits tighter VRAM budgets and increases node density.
	* Higher throughput: Serve more concurrent users on the same hardware.
	* Deployment-friendly: Smooth drop-in via SGLang with OpenAI-compatible API.
	* Adaptable: Plays well with light fine-tuning to match domain and style.

	## Intended use

	* Local/air-gapped assistants and dev tools
	* Cost-sensitive batches and realtime services
	* Edge and on-prem deployments prioritizing efficiency

	---

	## How Our Approach Works

	> Active research in progress — we continue to iterate and expand ablations.

	* Teacher–student setup: Start with MiniMax-M2 as teacher and a copy as student.
	* Gradual expert pruning: Remove ≈5% experts per stage over ~11 stages (≈55% total), guided by importance scores with a lightweight Leave-One-Expert-Out check to retain rare-but-important experts.
	* Distill after each prune: Retrain the student to imitate the teacher on

	* Outputs (token probability distributions),
	* Hidden states, and
	* Router behavior over the surviving experts.

	---

	Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max)
	https://github.com/latent-variable/minimax-agent-guide