File size: 3,976 Bytes
2a21ed3
 
 
 
 
 
 
f401b3d
 
 
 
 
 
 
2a21ed3
e1e50c9
2a21ed3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aca0117
296979a
 
 
 
 
 
 
 
 
 
 
 
 
229447a
296979a
 
 
 
 
 
 
6c85e69
296979a
 
 
 
 
 
 
 
 
 
 
 
2a21ed3
296979a
 
 
 
 
 
 
 
 
 
 
 
2a21ed3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aca0117
2a21ed3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aca0117
2a21ed3
 
 
 
 
 
 
 
f401b3d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
pipeline_tag: image-text-to-text
license: other
license_name: minimax-community
license_link: LICENSE
library_name: transformers
tags:
- multimodal
- moe
- agent
- coding
- video
base_model:
- MiniMaxAI/MiniMax-M3
---

# Model Overview

- **Model Architecture:** MiniMaxM3SparseForConditionalGeneration
  - **Input:** Text, Image
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.1.1
- **PyTorch**: 2.10.0
- **Transformers**: 5.2.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic


# Model Quantization

The model was quantized from [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized to MXFP4 and activations are quantized to MXFP4.


**Quantization scripts:**
```python
from quark.torch import LLMTemplate, ModelQuantizer

# --- Register template ---
minimax_m3_vl_template = LLMTemplate(
    model_type="minimax_m3_vl",
    kv_layers_name=["*language_model.*k_proj", "*language_model.*v_proj"],
    q_layer_name="*language_model.*q_proj",
    exclude_layers_name=[
        "*lm_head",
        "*vision_tower*",
        "*multi_modal_projector*",
        "*patch_merge_mlp*",
        "*block_sparse_moe.gate",
        "*self_attn*",
    ],
)
LLMTemplate.register_template(minimax_m3_vl_template)
print(f"[INFO]: Registered template '{minimax_m3_vl_template.model_type}'")

# --- Configuration ---
model_dir = "MiniMaxAI/MiniMax-M3"
output_dir = "amd/MiniMax-M3-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = [
    "*lm_head",
    "*vision_tower*",
    "*multi_modal_projector*",
    "*patch_merge_mlp*",
    "*block_sparse_moe.gate",
    "*self_attn*",
    "*mlp.gate_proj",
    "*mlp.up_proj",
    "*mlp.down_proj",
]

# --- Build quant config from template ---
template = LLMTemplate.get("minimax_m3_vl")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# --- File-to-file quantization (memory-efficient, no full model loading) ---
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=model_dir,
    save_path=output_dir,
)
print(f"[INFO]: Quantization complete. Output saved to {output_dir}")
```

# Evaluation
The model was evaluated on gsm8k benchmarks using the vllm framework.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>MiniMaxAI/MiniMax-M3 </strong>
   </td>
   <td><strong>amd/MiniMax-M3-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>gsm8k (flexible-extract) 
   </td>
   <td>95.30
   </td>
   <td>94.19
   </td>
   <td>98.84%
   </td>
  </tr>
</table>


### Reproduction

The GSM8K results were obtained using the lm-eval framework, based on the
Docker image `rocm/pytorch-private:vllm-hy-mm-06112026`. The vLLM shipped in
that image was used as-is, with the patch from this PR ([#45794](https://github.com/vllm-project/vllm/pull/45794/changes)) applied on top.

#### Launching server
```bash
vllm serve /mnt/amd/MiniMax-M3-MXFP4 \
  --trust-remote-code \
  --block-size 128 \
  --tensor-parallel-size 8 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 \
  --moe-backend emulation
```


#### Evaluating model in a new terminal
```bash
lm_eval \
  --model local-chat-completions \
  --model_args "model=/mnt/amd/MiniMax-M3-MXFP4,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32,max_gen_toks=16384" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1 \
  --apply_chat_template \
  --fewshot_as_multiturn
```