MiniMax-M2.7-Quark-W8A8-INT8

W8A8 INT8 quantized version of MiniMaxAI/MiniMax-M2.7 (456B MoE) using AMD Quark.

Model Details

Base Model MiniMaxAI/MiniMax-M2.7
Architecture MoE (Mixture of Experts), 62 layers, 256 experts, top-8 routing
Parameters 456B total, ~45.9B active
Quantization W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer AMD Quark (ptpc_int8 scheme)
Model Size 216 GB (47 safetensors shards)
Original Size ~216 GB (FP8 E4M3 blockwise)

Quantization Scheme

Component Dtype Granularity Mode
Weight INT8 per-channel (ch_axis=0) symmetric, static
Activation INT8 per-token (ch_axis=1) symmetric, dynamic
lm_head BF16 unquantized
MoE gates BF16 unquantized

Accuracy

GSM8K 8-shot evaluation (vLLM, temperature=0):

Model Quantization GSM8K 8-shot Correct/Total
MiniMax-M2.7 (FP8 original) FP8 block-wise [128,128] 92.80% 1224/1319
MiniMax-M2.7 (this model) W8A8 INT8 per-channel/per-token 92.19% 1216/1319

How to Use

With vLLM (Recommended)

# Start the server
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
    --model nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 4096

# Chat completion
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8",
  "messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
  "max_tokens": 256,
  "temperature": 0.7
}'

Hardware Requirements

  • Minimum: 2x GPUs with ≥128 GB VRAM each (e.g., AMD MI355X) or 4x GPUs with ≥48 GB VRAM each (e.g., AMD MI300X)
  • Tensor Parallelism: TP=2 (MI355X) or TP=4 (MI300X) for 216 GB model

Quantization Details

This model was quantized using AMD Quark's ptpc_int8 (Per-Token Per-Channel INT8) scheme:

  • Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static
  • Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time)
  • Excluded layers: lm_head (to preserve output quality) and all MoE gate layers (to preserve routing precision)

Citation

If you use this model, please cite the original MiniMax-M2.7 model:

@misc{minimax2025minimaxm27,
    title={MiniMax-M2.7},
    author={MiniMax},
    year={2025},
    url={https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}

License

This model inherits the Modified MIT License from the base model.

Downloads last month
33
Safetensors
Model size
229B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nameistoken/MiniMax-M2.7-Quark-W8A8-INT8

Quantized
(107)
this model