|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: vllm |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- text-generation |
|
|
- conversational |
|
|
- moe |
|
|
- quantized |
|
|
- compressed-tensors |
|
|
- awq |
|
|
- w4a16 |
|
|
- nvfp4 |
|
|
base_model: MiniMaxAI/MiniMax-M2.5 |
|
|
base_model_relation: quantized |
|
|
quantized_by: TheHouseOfTheDude |
|
|
license: other |
|
|
--- |
|
|
|
|
|
# MiniMax-M2.5 — Quantized (compressed-tensors for vLLM) |
|
|
|
|
|
This repository contains **quantized inference builds** of **MiniMaxAI/MiniMax-M2.5** exported in the **compressed-tensors** layout for **vLLM**. |
|
|
|
|
|
MiniMax-M2.5 is a large **Mixture-of-Experts (MoE)** model. The attached quant scripts calibrate **all experts** (not just router top-k) to produce more robust scales across the full mixture. |
|
|
|
|
|
--- |
|
|
|
|
|
## Variants / Branches |
|
|
|
|
|
This repo publishes **two quant variants**: |
|
|
|
|
|
- **AWQ-INT4** — weight-only AWQ (**INT4 weights**, FP16/BF16 activations at runtime) |
|
|
- **NVFP4** — NVFP4 quant (**FP4 weights + FP4 activations**), intended for runtimes that support NVFP4 kernels |
|
|
|
|
|
> The `main` branch is typically a landing page. The runnable artifacts live under the **AWQ-INT4** and **NVFP4** branches. |
|
|
|
|
|
--- |
|
|
|
|
|
## What’s inside (per variant) |
|
|
|
|
|
Each variant branch includes: |
|
|
|
|
|
- Sharded quantized weights (`*.safetensors`) + `model.safetensors.index.json` |
|
|
- `config.json` with compressed-tensors quant metadata |
|
|
- Tokenizer artifacts (and chat template assets if present) |
|
|
|
|
|
Exports are written with `save_compressed=True` so vLLM can load them as **compressed-tensors**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Critical MoE detail: all experts are activated during calibration |
|
|
|
|
|
Calibration is **MoE-aware**: |
|
|
|
|
|
1. Each MoE block is wrapped/replaced during calibration so **ALL experts execute** for calibration forward passes. |
|
|
2. The oneshot quant call is configured to **calibrate all experts** end-to-end. |
|
|
|
|
|
**Why it matters:** If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quantization scope: what is and is not quantized |
|
|
|
|
|
### Shared rule (both variants) |
|
|
|
|
|
The scripts are designed to quantize **only the MoE expert MLP weights**, e.g.: |
|
|
|
|
|
- `block_sparse_moe.experts.*.w1` |
|
|
- `block_sparse_moe.experts.*.w2` |
|
|
- `block_sparse_moe.experts.*.w3` |
|
|
|
|
|
Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, `lm_head`, etc.). |
|
|
|
|
|
--- |
|
|
|
|
|
## AWQ-INT4 (W4A16) details |
|
|
|
|
|
- **Weights:** INT4 (`num_bits=4`, symmetric) |
|
|
- **Activations:** A16 runtime (FP16/BF16) |
|
|
- **Grouping:** group-wise AWQ; group size is configured by the script/CLI |
|
|
- **Targets:** linear layers (restricted to expert MLP linears per scope) |
|
|
- **Ignored:** attention/embeddings/router/norms/`lm_head` (kept higher precision) |
|
|
- **Smoothing:** script sets up scaling maps around post-attn norms and expert MLP weights to improve stability |
|
|
|
|
|
--- |
|
|
|
|
|
## NVFP4 details |
|
|
|
|
|
- **Weights:** FP4 |
|
|
- **Activations:** FP4 |
|
|
- **Targets:** linear layers (restricted to expert MLP linears per scope) |
|
|
- **Ignored:** attention/embeddings/router/norms/`lm_head` |
|
|
- **Runtime:** requires NVFP4-capable kernels (often newer GPU + software stack) |
|
|
|
|
|
--- |
|
|
|
|
|
## Calibration data, sample count, and sequence length |
|
|
|
|
|
Both scripts use a **dataset recipe YAML/config** that controls: |
|
|
|
|
|
- `max_seq_length` |
|
|
- shuffle + seed |
|
|
- optional `num_samples` |
|
|
- dataset sources with formatter/column mapping and per-source sample counts |
|
|
|
|
|
**Tokenization behavior** |
|
|
|
|
|
- `padding=False` |
|
|
- `truncation=True` |
|
|
- `max_length=MAX_SEQUENCE_LENGTH` |
|
|
- `add_special_tokens=False` |
|
|
|
|
|
> The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs. |
|
|
|
|
|
--- |
|
|
|
|
|
## FP8 compatibility handling (base stored as FP8) |
|
|
|
|
|
If the base ships FP8 parameters, the scripts: |
|
|
|
|
|
- load in BF16, |
|
|
- convert FP8 parameters to BF16 for quantization compatibility, |
|
|
- sanitize quantization-related config fields to avoid serialization/tracing issues. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart (vLLM) |
|
|
|
|
|
### AWQ-INT4 branch |
|
|
|
|
|
```bash |
|
|
pip install -U vllm |
|
|
|
|
|
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \ |
|
|
--quantization compressed-tensors \ |
|
|
--tensor-parallel-size 8 \ |
|
|
--enable-expert-parallel \ |
|
|
--dtype bfloat16 |
|
|
``` |
|
|
|
|
|
### NVFP4 branch |
|
|
|
|
|
```bash |
|
|
pip install -U vllm |
|
|
|
|
|
vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \ |
|
|
--quantization compressed-tensors \ |
|
|
--tensor-parallel-size 8 \ |
|
|
--enable-expert-parallel |
|
|
``` |
|
|
|
|
|
**Notes** |
|
|
|
|
|
- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended. |
|
|
- Long context is KV-cache heavy; tune `--max-model-len`, batch size, and GPU memory utilization accordingly. |
|
|
- Serving from a local path works too—point `vllm serve` at the variant directory (e.g., `.../AWQ-INT4` or `.../NVFP4`). |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended use |
|
|
|
|
|
- High-throughput instruction/chat inference where MoE efficiency matters |
|
|
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory |
|
|
- Long-context workloads (subject to your hardware limits) |
|
|
|
|
|
Quantization changes **weight representation only**. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate. |
|
|
|
|
|
--- |
|
|
|
|
|
## Lineage |
|
|
|
|
|
- **Base model:** https://huggingface.co/MiniMaxAI/MiniMax-M2.5 |
|
|
- **This repo:** quantized inference variants exported to **compressed-tensors** for vLLM: |
|
|
- **AWQ-INT4** |
|
|
- **NVFP4** |
|
|
|
|
|
--- |
|
|
|
|
|
## Changelog |
|
|
|
|
|
- **v1 (current)** — Initial release with two quant variants: |
|
|
- **AWQ-INT4** (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script) |
|
|
- **NVFP4** (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime) |
|
|
|