Text Generation
Transformers
Safetensors
minimax_m2
quantized
int8
w8a8
quark
Mixture of Experts
vllm
conversational
custom_code
8-bit precision
Instructions to use nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nameistoken/MiniMax-M2.7-Quark-W8A8-INT8
- SGLang
How to use nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 with Docker Model Runner:
docker model run hf.co/nameistoken/MiniMax-M2.7-Quark-W8A8-INT8
MiniMax-M2.7-Quark-W8A8-INT8
W8A8 INT8 quantized version of MiniMaxAI/MiniMax-M2.7 (456B MoE) using AMD Quark.
Model Details
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MoE (Mixture of Experts), 62 layers, 256 experts, top-8 routing |
| Parameters | 456B total, ~45.9B active |
| Quantization | W8A8 INT8 (per-channel weight + per-token dynamic activation) |
| Quantizer | AMD Quark (ptpc_int8 scheme) |
| Model Size | 216 GB (47 safetensors shards) |
| Original Size | ~216 GB (FP8 E4M3 blockwise) |
Quantization Scheme
| Component | Dtype | Granularity | Mode |
|---|---|---|---|
| Weight | INT8 | per-channel (ch_axis=0) | symmetric, static |
| Activation | INT8 | per-token (ch_axis=1) | symmetric, dynamic |
lm_head |
BF16 | — | unquantized |
| MoE gates | BF16 | — | unquantized |
Accuracy
GSM8K 8-shot evaluation (vLLM, temperature=0):
| Model | Quantization | GSM8K 8-shot | Correct/Total |
|---|---|---|---|
| MiniMax-M2.7 (FP8 original) | FP8 block-wise [128,128] | 92.80% | 1224/1319 |
| MiniMax-M2.7 (this model) | W8A8 INT8 per-channel/per-token | 92.19% | 1216/1319 |
How to Use
With vLLM (Recommended)
# Start the server
VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--model nameistoken/MiniMax-M2.7-Quark-W8A8-INT8 \
--tensor-parallel-size 4 \
--trust-remote-code \
--max-model-len 4096
# Chat completion
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nameistoken/MiniMax-M2.7-Quark-W8A8-INT8",
"messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
"max_tokens": 256,
"temperature": 0.7
}'
Hardware Requirements
- Minimum: 2x GPUs with ≥128 GB VRAM each (e.g., AMD MI355X) or 4x GPUs with ≥48 GB VRAM each (e.g., AMD MI300X)
- Tensor Parallelism: TP=2 (MI355X) or TP=4 (MI300X) for 216 GB model
Quantization Details
This model was quantized using AMD Quark's ptpc_int8 (Per-Token Per-Channel INT8) scheme:
- Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static
- Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time)
- Excluded layers:
lm_head(to preserve output quality) and all MoE gate layers (to preserve routing precision)
Citation
If you use this model, please cite the original MiniMax-M2.7 model:
@misc{minimax2025minimaxm27,
title={MiniMax-M2.7},
author={MiniMax},
year={2025},
url={https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}
License
This model inherits the Modified MIT License from the base model.
- Downloads last month
- 33
Model tree for nameistoken/MiniMax-M2.7-Quark-W8A8-INT8
Base model
MiniMaxAI/MiniMax-M2.7