File size: 5,570 Bytes

8edd70b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37419ca
8edd70b
37419ca
8edd70b
 
 
 
 
 
 
37419ca
 
8edd70b
37419ca
8edd70b
 
 
 
 
 
 
 
 
 
 
37419ca
8edd70b
 
 
37419ca
8edd70b
37419ca
8edd70b
37419ca
 
 
 
8edd70b
 
 
37419ca
8edd70b
 
37419ca
 
 
8edd70b
 
 
 
37419ca
8edd70b
37419ca
8edd70b
37419ca
8edd70b
37419ca
 
 
 
 
 
8edd70b
37419ca
 
 
 
 
 
 
 
 
8edd70b
 
 
37419ca
 
 
 
 
 
 
 
8edd70b
37419ca
8edd70b
 
 
 
37419ca
8edd70b
37419ca
8edd70b
 
 
37419ca
8edd70b
37419ca
 
 
 
 
8edd70b
 
 
 
 
 
 
 
 
37419ca
8edd70b
 
 
 
 
37419ca

---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - moe
  - quantized
  - compressed-tensors
  - awq
  - w4a16
  - nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)

This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**.

MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture.

---

## Variants / Branches

This repo publishes **two quant variants**:

- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime)
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels

> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches.

---

## What’s inside (per variant)

Each variant branch includes:

- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json`
- `config.json` with compressed-tensors quant metadata
- Tokenizer artifacts (and chat template assets if present)

Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**.

---

## Critical MoE detail: all experts are activated during calibration

Calibration is **MoE-aware**:

1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes.
2. The oneshot quant call is configured to **calibrate all experts** end-to-end.

**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.

---

## Quantization scope: what is and is not quantized

### Shared rule (both variants)

The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.:

- `block_sparse_moe.experts.*.w1`
- `block_sparse_moe.experts.*.w2`
- `block_sparse_moe.experts.*.w3`

Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.).

---

## AWQ-INT4 (W4A16) details

- **Weights:** INT4 (`num_bits=4`, symmetric)
- **Activations:** A16 runtime (FP16/BF16)
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision)
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability

---

## NVFP4 details

- **Weights:** FP4
- **Activations:** FP4
- **Targets:** linear layers (restricted to expert MLP linears per scope)
- **Ignored:** attention/embeddings/router/norms/`lm_head`
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack)

---

## Calibration data, sample count, and sequence length

Both scripts use a **dataset recipe YAML/config** that controls:

- `max_seq_length`
- shuffle + seed
- optional `num_samples`
- dataset sources with formatter/column mapping and per-source sample counts

**Tokenization behavior**

- `padding=False`
- `truncation=True`
- `max_length=MAX_SEQUENCE_LENGTH`
- `add_special_tokens=False`

> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.

---

## FP8 compatibility handling (base stored as FP8)

If the base ships FP8 parameters, the scripts:

- load in BF16,
- convert FP8 parameters to BF16 for quantization compatibility,
- sanitize quantization-related config fields to avoid serialization/tracing issues.

---

## Quickstart (vLLM)

### AWQ-INT4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --dtype bfloat16
```

### NVFP4 branch

```bash
pip install -U vllm

vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --enable-expert-parallel
```

**Notes**

- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly.
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`).

---

## Intended use

- High-throughput instruction/chat inference where MoE efficiency matters
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
- Long-context workloads (subject to your hardware limits)

Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.

---

## Lineage

- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM:
  - **AWQ-INT4**
  - **NVFP4**

---

## Changelog

- **v1 (current)** — Initial release with two quant variants:
  - **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
  - **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)