language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
- text-generation
- conversational
- moe
- quantized
- compressed-tensors
- awq
- w4a16
- nvfp4
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
MiniMax-M2.5 — Quantized (compressed-tensors for vLLM)
This repository contains quantized inference builds of MiniMaxAI/MiniMax-M2.5 exported in the compressed-tensors layout for vLLM.
MiniMax-M2.5 is a large Mixture-of-Experts (MoE) model. The attached quant scripts calibrate all experts (not just router top-k) to produce more robust scales across the full mixture.
Variants / Branches
This repo publishes two quant variants:
- AWQ-INT4 — weight-only AWQ (INT4 weights, FP16/BF16 activations at runtime)
- NVFP4 — NVFP4 quant (FP4 weights + FP4 activations), intended for runtimes that support NVFP4 kernels
The
mainbranch is typically a landing page. The runnable artifacts live under the AWQ-INT4 and NVFP4 branches.
What’s inside (per variant)
Each variant branch includes:
- Sharded quantized weights (
*.safetensors) +model.safetensors.index.json config.jsonwith compressed-tensors quant metadata- Tokenizer artifacts (and chat template assets if present)
Exports are written with save_compressed=True so vLLM can load them as compressed-tensors.
Critical MoE detail: all experts are activated during calibration
Calibration is MoE-aware:
- Each MoE block is wrapped/replaced during calibration so ALL experts execute for calibration forward passes.
- The oneshot quant call is configured to calibrate all experts end-to-end.
Why it matters: If only top-k experts are exercised, rare experts can receive poor scales and quantize badly—leading to instability when those experts trigger at inference time.
Quantization scope: what is and is not quantized
Shared rule (both variants)
The scripts are designed to quantize only the MoE expert MLP weights, e.g.:
block_sparse_moe.experts.*.w1block_sparse_moe.experts.*.w2block_sparse_moe.experts.*.w3
Everything else is excluded for stability (embeddings, attention, router/gate, norms, rotary, lm_head, etc.).
AWQ-INT4 (W4A16) details
- Weights: INT4 (
num_bits=4, symmetric) - Activations: A16 runtime (FP16/BF16)
- Grouping: group-wise AWQ; group size is configured by the script/CLI
- Targets: linear layers (restricted to expert MLP linears per scope)
- Ignored: attention/embeddings/router/norms/
lm_head(kept higher precision) - Smoothing: script sets up scaling maps around post-attn norms and expert MLP weights to improve stability
NVFP4 details
- Weights: FP4
- Activations: FP4
- Targets: linear layers (restricted to expert MLP linears per scope)
- Ignored: attention/embeddings/router/norms/
lm_head - Runtime: requires NVFP4-capable kernels (often newer GPU + software stack)
Calibration data, sample count, and sequence length
Both scripts use a dataset recipe YAML/config that controls:
max_seq_length- shuffle + seed
- optional
num_samples - dataset sources with formatter/column mapping and per-source sample counts
Tokenization behavior
padding=Falsetruncation=Truemax_length=MAX_SEQUENCE_LENGTHadd_special_tokens=False
The exact dataset names/counts live in your recipe file; this README documents the pipeline and knobs.
FP8 compatibility handling (base stored as FP8)
If the base ships FP8 parameters, the scripts:
- load in BF16,
- convert FP8 parameters to BF16 for quantization compatibility,
- sanitize quantization-related config fields to avoid serialization/tracing issues.
Quickstart (vLLM)
AWQ-INT4 branch
pip install -U vllm
vllm serve TheHouseOfTheDude/MiniMax-M2.5:AWQ-INT4 \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--dtype bfloat16
NVFP4 branch
pip install -U vllm
vllm serve TheHouseOfTheDude/MiniMax-M2.5:NVFP4 \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--enable-expert-parallel
Notes
- MiniMax-M2.5 is extremely large; multi-GPU + expert parallel is strongly recommended.
- Long context is KV-cache heavy; tune
--max-model-len, batch size, and GPU memory utilization accordingly. - Serving from a local path works too—point
vllm serveat the variant directory (e.g.,.../AWQ-INT4or.../NVFP4).
Intended use
- High-throughput instruction/chat inference where MoE efficiency matters
- Large-scale serving stacks that benefit from reduced weight bandwidth and memory
- Long-context workloads (subject to your hardware limits)
Quantization changes weight representation only. It does not modify tokenizer, chat template, or safety behavior. Apply your own safety policies/filters as appropriate.
Lineage
- Base model: https://huggingface.co/MiniMaxAI/MiniMax-M2.5
- This repo: quantized inference variants exported to compressed-tensors for vLLM:
- AWQ-INT4
- NVFP4
Changelog
- v1 (current) — Initial release with two quant variants:
- AWQ-INT4 (expert-only W4A16 AWQ; all-experts calibration; group size configurable in script)
- NVFP4 (FP4 weights + FP4 activations; expert-only scope; all-experts calibration; requires NVFP4-capable runtime)