MiniMax-M2.5 / README.md

Fixing Model Card

37419ca verified 2 days ago

5.57 kB

	---
	language:
	- en
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- text-generation
	- conversational
	- moe
	- quantized
	- compressed-tensors
	- awq
	- w4a16
	- nvfp4
	base_model: MiniMaxAI/MiniMax-M2.5
	base_model_relation: quantized
	quantized_by: TheHouseOfTheDude
	license: other
	---

	# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)

	This repository contains quantized inference builds of MiniMaxAI/MiniMax-M2.5 exported in the compressed-tensors layout for vLLM.

	MiniMax-M2.5 is a large Mixture-of-Experts (MoE) model. The attached quant scripts calibrate all experts (not just router top-k) to produce more robust scales across the full mixture.

	---

	## Variants / Branches

	This repo publishes two quant variants:

	- AWQ-INT4 — weight-only AWQ (INT4 weights, FP16/BF16 activations at runtime)
	- NVFP4 — NVFP4 quant (FP4 weights + FP4 activations), intended for runtimes that support NVFP4 kernels

	> The `main` branch is typically a landing page. The runnable artifacts live under the AWQ-INT4 and NVFP4 branches.

	---

	## What’s inside (per variant)

	Each variant branch includes:

	- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
	- `config.json` with compressed-tensors quant metadata
	- Tokenizer artifacts (and chat template assets if present)

	Exports are written with `save_compressed=True` so vLLM can load them as compressed-tensors.

	---

	## Critical MoE detail: all experts are activated during calibration

	Calibration is MoE-aware:

	1. Each MoE block is wrapped/replaced during calibration so ALL experts execute for calibration forward passes.
	2. The oneshot quant call is configured to calibrate all experts end-to-end.

	Why it matters: If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.

	---

	## Quantization scope: what is and is not quantized

	### Shared rule (both variants)

	The scripts are designed to quantize only the MoE expert MLP weights, e.g.:

	- `block_sparse_moe.experts.*.w1`
	- `block_sparse_moe.experts.*.w2`
	- `block_sparse_moe.experts.*.w3`

	Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).

	---

	## AWQ-INT4 (W4A16) details

	- Weights: INT4 (`num_bits=4`, symmetric)
	- Activations: A16 runtime (FP16/BF16)
	- Grouping: group-wise AWQ; group size is configured by the script/CLI
	- Targets: linear layers (restricted to expert MLP linears per scope)
	- Ignored: attention/embeddings/router/norms/`lm_head` (kept higher precision)
	- Smoothing: script sets up scaling maps around post-attn norms and expert MLP weights to improve stability

	---

	## NVFP4 details

	- Weights: FP4
	- Activations: FP4
	- Targets: linear layers (restricted to expert MLP linears per scope)
	- Ignored: attention/embeddings/router/norms/`lm_head`
	- Runtime: requires NVFP4-capable kernels (often newer GPU + software stack)

	---

	## Calibration data, sample count, and sequence length

	Both scripts use a dataset recipe YAML/config that controls:

	- `max_seq_length`
	- shuffle + seed
	- optional `num_samples`
	- dataset sources with formatter/column mapping and per-source sample counts

	Tokenization behavior

	- `padding=False`
	- `truncation=True`
	- `max_length=MAX_SEQUENCE_LENGTH`
	- `add_special_tokens=False`

	> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.

	---

	## FP8 compatibility handling (base stored as FP8)

	If the base ships FP8 parameters, the scripts:

	- load in BF16,
	- convert FP8 parameters to BF16 for quantization compatibility,
	- sanitize quantization-related config fields to avoid serialization/tracing issues.

	---

	## Quickstart (vLLM)

	### AWQ-INT4 branch

	```bash
	pip install -U vllm

	vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
	--quantization compressed-tensors \
	--tensor-parallel-size 8 \
	--enable-expert-parallel \
	--dtype bfloat16
	```

	### NVFP4 branch

	```bash
	pip install -U vllm

	vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
	--quantization compressed-tensors \
	--tensor-parallel-size 8 \
	--enable-expert-parallel
	```

	Notes

	- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
	- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
	- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).

	---

	## Intended use

	- High-throughput instruction/chat inference where MoE efficiency matters
	- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
	- Long-context workloads (subject to your hardware limits)

	Quantization changes weight representation only. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.

	---

	## Lineage

	- Base model: https://huggingface.co/MiniMaxAI/MiniMax-M2.5
	- This repo: quantized inference variants exported to compressed-tensors for vLLM:
	- AWQ-INT4
	- NVFP4

	---

	## Changelog

	- v1 (current) — Initial release with two quant variants:
	- AWQ-INT4 (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
	- NVFP4 (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)