--- pipeline_tag: image-text-to-text license: other license_name: minimax-community license_link: LICENSE library_name: transformers tags: - multimodal - moe - agent - coding - video base_model: - MiniMaxAI/MiniMax-M3 --- # Model Overview - **Model Architecture:** MiniMaxM3SparseForConditionalGeneration - **Input:** Text, Image - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm**: 7.1.1 - **PyTorch**: 2.10.0 - **Transformers**: 5.2.0 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) - **Weight quantization:** OCP MXFP4, Static - **Activation quantization:** OCP MXFP4, Dynamic # Model Quantization The model was quantized from [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized to MXFP4 and activations are quantized to MXFP4. **Quantization scripts:** ```python from quark.torch import LLMTemplate, ModelQuantizer # --- Register template --- minimax_m3_vl_template = LLMTemplate( model_type="minimax_m3_vl", kv_layers_name=["*language_model.*k_proj", "*language_model.*v_proj"], q_layer_name="*language_model.*q_proj", exclude_layers_name=[ "*lm_head", "*vision_tower*", "*multi_modal_projector*", "*patch_merge_mlp*", "*block_sparse_moe.gate", "*self_attn*", ], ) LLMTemplate.register_template(minimax_m3_vl_template) print(f"[INFO]: Registered template '{minimax_m3_vl_template.model_type}'") # --- Configuration --- model_dir = "MiniMaxAI/MiniMax-M3" output_dir = "amd/MiniMax-M3-MXFP4" quant_scheme = "mxfp4" exclude_layers = [ "*lm_head", "*vision_tower*", "*multi_modal_projector*", "*patch_merge_mlp*", "*block_sparse_moe.gate", "*self_attn*", "*mlp.gate_proj", "*mlp.up_proj", "*mlp.down_proj", ] # --- Build quant config from template --- template = LLMTemplate.get("minimax_m3_vl") quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers) # --- File-to-file quantization (memory-efficient, no full model loading) --- quantizer = ModelQuantizer(quant_config) quantizer.direct_quantize_checkpoint( pretrained_model_path=model_dir, save_path=output_dir, ) print(f"[INFO]: Quantization complete. Output saved to {output_dir}") ``` # Evaluation The model was evaluated on gsm8k benchmarks using the vllm framework. ### Accuracy
| Benchmark | MiniMaxAI/MiniMax-M3 | amd/MiniMax-M3-MXFP4(this model) | Recovery |
| gsm8k (flexible-extract) | 95.30 | 94.19 | 98.84% |