MiniMax-M2.5 / README.md
phaedawg's picture
Fixing Model Card
37419ca verified
metadata
language:
  - en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - moe
  - quantized
  - compressed-tensors
  - awq
  - w4a16
  - nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other

MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)

This repository contains quantized inference builds of MiniMaxAI/MiniMax-M2.5 exported in the compressed-tensors layout for vLLM.

MiniMax-M2.5 is a large Mixture-of-Experts (MoE) model. The attached quant scripts calibrate all experts (not just router top-k) to produce more robust scales across the full mixture.


Variants / Branches

This repo publishes two quant variants:

  • AWQ-INT4 — weight-only AWQ (INT4 weights, FP16/BF16 activations at runtime)
  • NVFP4 — NVFP4 quant (FP4 weights + FP4 activations), intended for runtimes that support NVFP4 kernels

The main branch is typically a landing page. The runnable artifacts live under the AWQ-INT4 and NVFP4 branches.


What’s inside (per variant)

Each variant branch includes:

  • Sharded quantized weights (*.safetensors) + model.safetensors.index.json
  • config.json with compressed-tensors quant metadata
  • Tokenizer artifacts (and chat template assets if present)

Exports are written with save_compressed=True so vLLM can load them as compressed-tensors.


Critical MoE detail: all experts are activated during calibration

Calibration is MoE-aware:

  1. Each MoE block is wrapped/replaced during calibration so ALL experts execute for calibration forward passes.
  2. The oneshot quant call is configured to calibrate all experts end-to-end.

Why it matters: If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.


Quantization scope: what is and is not quantized

Shared rule (both variants)

The scripts are designed to quantize only the MoE expert MLP weights, e.g.:

  • block_sparse_moe.experts.*.w1
  • block_sparse_moe.experts.*.w2
  • block_sparse_moe.experts.*.w3

Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, lm_head, etc.).


AWQ-INT4 (W4A16) details

  • Weights: INT4 (num_bits=4, symmetric)
  • Activations: A16 runtime (FP16/BF16)
  • Grouping: group-wise AWQ; group size is configured by the script/CLI
  • Targets: linear layers (restricted to expert MLP linears per scope)
  • Ignored: attention/embeddings/router/norms/lm_head (kept higher precision)
  • Smoothing: script sets up scaling maps around post-attn norms and expert MLP weights to improve stability

NVFP4 details

  • Weights: FP4
  • Activations: FP4
  • Targets: linear layers (restricted to expert MLP linears per scope)
  • Ignored: attention/embeddings/router/norms/lm_head
  • Runtime: requires NVFP4-capable kernels (often newer GPU + software stack)

Calibration data, sample count, and sequence length

Both scripts use a dataset recipe YAML/config that controls:

  • max_seq_length
  • shuffle + seed
  • optional num_samples
  • dataset sources with formatter/column mapping and per-source sample counts

Tokenization behavior

  • padding=False
  • truncation=True
  • max_length=MAX_SEQUENCE_LENGTH
  • add_special_tokens=False

The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.


FP8 compatibility handling (base stored as FP8)

If the base ships FP8 parameters, the scripts:

  • load in BF16,
  • convert FP8 parameters to BF16 for quantization compatibility,
  • sanitize quantization-related config fields to avoid serialization/tracing issues.

Quickstart (vLLM)

AWQ-INT4 branch

pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype bfloat16

NVFP4 branch

pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

Notes

  • MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
  • Long context is KV-cache heavy; tune --max-model-len, batch size, and GPU memory utilization accordingly.
  • Serving from a local path works too—point vllm serve at the variant directory (e.g., .../AWQ-INT4 or .../NVFP4).

Intended use

  • High-throughput instruction/chat inference where MoE efficiency matters
  • Large-scale serving stacks that benefit from reduced weight bandwidth and memory
  • Long-context workloads (subject to your hardware limits)

Quantization changes weight representation only. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.


Lineage


Changelog

  • v1 (current) — Initial release with two quant variants:
    • AWQ-INT4 (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
    • NVFP4 (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)