--- license: other license_name: model-license library_name: transformers pipeline_tag: text-generation tags: - minimax - mixture-of-experts - moe - pruning - expert-pruning - fp8 base_model: morriszjm/MiniMax-M2.5-tiny --- # MiniMax-M2.5-tiny-24e Training-free expert-pruned variant of [`morriszjm/MiniMax-M2.5-tiny`](https://huggingface.co/morriszjm/MiniMax-M2.5-tiny), produced by the [`minimax_expert_pruning`](https://github.com/-) pipeline. ## What changed - `num_local_experts`: **32 → 24** (pruning rate: **25.0 %**) - All non-MoE tensors (attention, layernorms, embeddings, lm_head, MTP heads if any) are **bit-identical** to the source model. - `gate.weight` and `e_score_correction_bias` per MoE layer are row-sliced to the kept experts; per-expert tensors of dropped experts are absent; kept experts are renumbered contiguously to `0..23`. - `top_k = num_experts_per_tok` is unchanged (8). ## Method (one-paragraph) We run a small calibration set (64 prompts spanning Nokia AI4Code, general English Q&A, multilingual, and reasoning) through the unpruned source model and hook every MoE layer's router. Per layer, we accumulate each expert's **selected probability mass** — the post-sigmoid routing weight that the expert receives, summed over all calibration tokens that selected it in their top-8. We keep the top-K by this score per layer (uniform K) and atomically slice the on-disk per-expert tensors. No gradients, no fine-tuning. ## Layer-level statistics - Layers covered: **8** - Tokens per layer (calibration): **1,851** - Calibration prompts by bucket: `{"ai4code": 1008, "general_en": 416, "reasoning": 257, "multilingual": 170}` - Median per-layer "kept-min vs drop-max" routing-mass gap: **+0.7197** (positive = clean separation between the kept and dropped experts; close to zero or negative = experts of similar utility, expect more quality risk) ## Intended use Production-style serving of the source model's domain (Nokia / Merlin AI4Code plus general English) at reduced HBM footprint. Expect graceful quality degradation versus the unpruned source on tasks well-covered by the calibration mix; quality on out-of-distribution domains may drop further. ## Limitations - Training-free: no fine-tune recovery, no distillation, no merge. - Uniform K per layer: late layers may tolerate more pruning than early ones, unexploited here. - Calibration mix is small (64 prompts). Domain coverage is biased toward the included buckets. ## Files `config.json`, `model-NNNNN-of-NNNNN.safetensors` (FP8), `model.safetensors.index.json`, tokenizer, custom `modeling_minimax_m2.py` + `configuration_minimax_m2.py`, and `expert_prune_plan.json` (full record of which experts were kept per layer). ## Loading ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch tok = AutoTokenizer.from_pretrained("morriszjm/MiniMax-M2.5-tiny-24e", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "morriszjm/MiniMax-M2.5-tiny-24e", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) ``` For vLLM serving, pass `--trust-remote-code` and (on multi-GPU) match `--data-parallel-size` to the EP topology you compiled the K against.