| --- |
| license: apache-2.0 |
| base_model: MiniMaxAI/MiniMax-M2.7 |
| base_model_relation: quantized |
| tags: |
| - auto-round |
| - int4 |
| - w4a16 |
| - quantization |
| - moe |
| library_name: transformers |
| --- |
| |
| # MiniMax-M2.7 INT4 AutoRound |
|
|
| 4-bit quantized version of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) using [Intel AutoRound](https://github.com/intel/auto-round). |
|
|
| ## Quantization Config |
|
|
| | Setting | Value | |
| |---|---| |
| | Scheme | W4A16 (INT4 weights, FP16 activations) | |
| | Group size | 128 | |
| | Ignored layers | MoE `gate` layers (kept at full precision) | |
| | Method | RTN (iters=0) | |
|
|
| ## Usage |
|
|
| ### vLLM |
|
|
| ```bash |
| vllm serve Lasimeri/MiniMax-M2.7-int4-AutoRound \ |
| --trust-remote-code \ |
| --tensor-parallel-size 8 \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser minimax_m2 \ |
| --reasoning-parser minimax_m2_append_think |
| ``` |
|
|
| ### SGLang |
|
|
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Lasimeri/MiniMax-M2.7-int4-AutoRound \ |
| --trust-remote-code \ |
| --tp 8 \ |
| --reasoning-parser minimax-append-think \ |
| --tool-call-parser minimax-m2 |
| ``` |
|
|
| ## Quantization Hardware |
|
|
| Quantized on a single-node rig: |
|
|
| | Component | Spec | |
| |---|---| |
| | CPU | AMD EPYC 7742 (64C / 128T) | |
| | RAM | 251 GB DDR4 | |
| | GPUs | 8× RTX 3080 (20 GB modded) | |
|
|
| Peak resource usage during quantization: ~25.6 GB RAM, ~5 GB VRAM on GPU 0, ~1.3 GB on each remaining GPU. |
|
|