--- pipeline_tag: image-text-to-text license: other license_name: minimax-community license_link: LICENSE library_name: transformers tags: - multimodal - moe - agent - coding - video --- # Model Overview - **Model Architecture:** MiniMaxM3SparseForConditionalGeneration - **Input:** Text, Image - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm**: 7.1.1 - **PyTorch**: 2.10.0 - **Transformers**: 5.2.0 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) - **Weight quantization:** OCP MXFP4, Static - **Activation quantization:** OCP MXFP4, Dynamic # Model Quantization The model was quantized from [MiniMaxAI/MiniMax-M3](https://huggingface.co/MiniMaxAI/MiniMax-M3) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized to MXFP4 and activations are quantized to MXFP4. **Quantization scripts:** ``` cd Quark/examples/torch/language_modeling/llm_ptq/ exclude_layers="*lm_head *vision_tower* *multi_modal_projector* *patch_merge_mlp* *block_sparse_moe.gate *self_attn* *mlp.gate_proj *mlp.up_proj *mlp.down_proj" CUDA_VISIBLE_DEVICES=0 python3 quantize_quark.py \ --model_dir MiniMaxAI/MiniMax-M3 \ --quant_scheme mxfp4 \ --exclude_layers $exclude_layers \ --output_dir /mnt/amd/MiniMax-M3-MXFP4 \ --file2file_quantization ``` For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers. # Evaluation The model was evaluated on gsm8k benchmarks using the vllm framework. ### Accuracy
Benchmark MiniMaxAI/MiniMax-M3 amd/MiniMax-M3-MXFP4(this model) Recovery
gsm8k (flexible-extract) 95.30 94.19 98.84%
### Reproduction The GSM8K results were obtained using the lm-eval framework, based on the Docker image `rocm/pytorch-private:vllm-hy-mm-06112026`. The vLLM shipped in that image was used as-is, with the patch from this PR ([#45794](https://github.com/vllm-project/vllm/pull/45794/changes)) applied on top. #### Launching server ``` vllm serve /mnt/amd/MiniMax-M3-MXFP4 \ --trust-remote-code \ --block-size 128 \ --tensor-parallel-size 8 \ --attention-backend TRITON_ATTN \ --mm-encoder-tp-mode data \ --mm-encoder-attn-backend ROCM_AITER_FA \ --tool-call-parser minimax_m3 \ --enable-auto-tool-choice \ --reasoning-parser minimax_m3 \ --moe-backend emulation ``` #### Evaluating model in a new terminal ``` lm_eval \ --model local-chat-completions \ --model_args "model=/mnt/amd/MiniMax-M3-MXFP4,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32,max_gen_toks=16384" \ --tasks gsm8k \ --num_fewshot 5 \ --batch_size 1 \ --apply_chat_template \ --fewshot_as_multiturn ```