---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - moe
  - quantized
  - compressed-tensors
  - awq
  - w4a16
  - nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)

This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.

MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.

---

## Variants / Branches

This repo publishes **two quant variants**:

- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels

> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.

---

## What’s inside (per variant)

Each variant branch includes:

- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
- `config.json` with compressed-tensors quant metadata
- Tokenizer artifacts (and chat template assets if present)

Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.

---

## Critical MoE detail: all experts are activated during calibration

Calibration is **MoE-aware**:

1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
2. The oneshot quant call is configured to **calibrate all experts** end-to-end.

**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.

---

## Quantization scope: what is and is not quantized

### Shared rule (both variants)

The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:

- `block_sparse_moe.experts.*.w1`
- `block_sparse_moe.experts.*.w2`
- `block_sparse_moe.experts.*.w3`

Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).

---

## AWQ-INT4 (W4A16) details

- **Weights:** INT4 (`num_bits=4`, symmetric)
- **Activations:** A16 runtime (FP16/BF16)
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability

---

## NVFP4 details

- **Weights:** FP4
- **Activations:** FP4
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head`
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)

---

## Calibration data, sample count, and sequence length

Both scripts use a **dataset recipe YAML/config** that controls:

- `max_seq_length`
- shuffle + seed
- optional `num_samples`
- dataset sources with formatter/column mapping and per-source sample counts

**Tokenization behavior**

- `padding=False`
- `truncation=True`
- `max_length=MAX_SEQUENCE_LENGTH`
- `add_special_tokens=False`

> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.

---

## FP8 compatibility handling (base stored as FP8)

If the base ships FP8 parameters, the scripts:

- load in BF16,
- convert FP8 parameters to BF16 for quantization compatibility,
- sanitize quantization-related config fields to avoid serialization/tracing issues.

---

## Quickstart (vLLM)

### AWQ-INT4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype bfloat16
```

### NVFP4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel
```

**Notes**

- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).

---

## Intended use

- High-throughput instruction/chat inference where MoE efficiency matters
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
- Long-context workloads (subject to your hardware limits)

Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.

---

## Lineage

- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
  - **AWQ-INT4**
  - **NVFP4**

---

## Changelog

- **v1 (current)** — Initial release with two quant variants:
  - **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
  - **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)