Instructions to use amd/MiniMax-M2.1-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amd/MiniMax-M2.1-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="amd/MiniMax-M2.1-MXFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("amd/MiniMax-M2.1-MXFP4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("amd/MiniMax-M2.1-MXFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use amd/MiniMax-M2.1-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amd/MiniMax-M2.1-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/MiniMax-M2.1-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/amd/MiniMax-M2.1-MXFP4

SGLang

How to use amd/MiniMax-M2.1-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amd/MiniMax-M2.1-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/MiniMax-M2.1-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amd/MiniMax-M2.1-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/MiniMax-M2.1-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use amd/MiniMax-M2.1-MXFP4 with Docker Model Runner:
```
docker model run hf.co/amd/MiniMax-M2.1-MXFP4
```

Mismatch model shape

by twinsen123 - opened Feb 4

Discussion

twinsen123

Feb 4

Hi there,

we are hitting below issue that when we are running the model against with MI300X using the suggested VLLM version.

It reported the data.shape and loaded weight are incorrect

assert param_data.shape == loaded_weight.shape

docker run -it
--device=/dev/kfd
--device=/dev/dri
--group-add video
--shm-size 16G
--security-opt seccomp=unconfined
--security-opt apparmor=unconfined
--cap-add=SYS_PTRACE
--env VLLM_ROCM_USE_AITER=1
--env VLLM_DISABLE_COMPILE_CACHE=1
-p 8000:8000
-d
rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210
bash -c "
python3 -m vllm.entrypoints.openai.api_server
--model amd/MiniMax-M2.1-MXFP4
--gpu-memory-utilization 0.95
--max-model-len 196608
--kv-cache-dtype fp8
--enable-chunked-prefill false
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--quantization quark
--trust_remote_code
--enable-auto-tool-choice
--host 0.0.0.0
--port 8000

any chance can have some suggestion how to fix it?

linzhao-amd

AMD org Feb 6

•

edited Feb 6

Hi, this is a model support issue in vLLM for MiniMax-M2.
Would you please add a patch by applying the packed_modules_mapping to MiniMaxM2Model?
Take this as a reference: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/minimax_vl_01.py#L182

twinsen123

Feb 19

•

edited Feb 19

Sorry for late update. But the patch you recommended does not work (not sure if I did something wrong)

Patched code

@support_torch_compile
class MiniMaxM2Model(nn.Module):
+    packed_modules_mapping = {
+       "qkv_proj": ["q_proj", "k_proj", "v_proj"],
+        "gate_up_proj": ["gate_proj", "up_proj"],
+    }

Tested docker version

rocm/vllm:rocm7.0.0_vllm0.11.2_20251210
rocm/vllm:v0.14.0_amd_dev

It is possible to have you patch so that you can run this model on MI3XX series? I think it is huge boost for other to onboard more for AMD MI3XX series

ColinZ22

AMD org Mar 4

•

edited Mar 6

Can you please try adding the patched code in class MiniMaxM2ForCausalLM(nn.Module, SupportsLoRA, SupportsPP):instead of class MiniMaxM2Model(nn.Module):?

You may also try adding the following in the MiniMaxM2Model class's __init__ function:

            vllm_config.quant_config.packed_modules_mapping.update({
                "qkv_proj": ["q_proj", "k_proj", "v_proj"],
                "gate_up_proj": ["gate_proj", "up_proj"],
            })

twinsen123

Mar 13

We are able to pass through the mismatch model but hit another issue. The quark seemed access some invalid memory address and cause GPU hang.

(EngineCore_DP0 pid=5992) 
(EngineCore_DP0 pid=5992) [QUARK-INFO]: C++ kernel compilation check start.
(EngineCore_DP0 pid=5992) 
(EngineCore_DP0 pid=5992) [QUARK-INFO]: C++ kernel build directory: /root/.cache/torch_extensions/py312_cpu/kernel_ext. First-time compilation may take a few minutes... Building for architectures PYTORCH_ROCM_ARCH='gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151'.
(EngineCore_DP0 pid=5992) Successfully preprocessed all matching files.
(APIServer pid=5927) DEBUG 03-12 00:00:00 [v1/engine/utils.py:950] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=5927) DEBUG 03-12 00:00:00 [v1/engine/utils.py:950] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=5927) DEBUG 03-12 00:00:00 [v1/engine/utils.py:950] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore_DP0 pid=5992) 
(EngineCore_DP0 pid=5992) [QUARK-INFO]: C++ kernel compilation is already complete. Ending the C++ kernel compilation check. Total time: 211.4942 seconds
HW Exception by GPU node-1 (Agent handle: 0x30d01ce0) reason :GPU Hang
HW Exception by GPU node-1 (Agent handle: 0x2848ab50) reason :GPU Hang

The testing is running on MI300x1 (192gb) VM. Please see if anything I could try and use this model.

linzhao-amd

AMD org Mar 13

@twinsen123 MI300 does not support native MXFP4 kernel. Do you use emulative or native kernel?

twinsen123

Mar 13

•

edited Mar 13

Thanks for the prompt reply.

I got this warning when starting up so I assumed it is running emulative kernel

(EngineCore_DP0 pid=5992) WARNING 03-12 00:00:00 [model_executor/.../quark/quark_moe.py:441] The current mode (supports_mx=False, use_mxfp4_aiter_moe=True, ocp_mx_scheme=OCP_MX_Scheme.w_mxfp4_a_mxfp4) does not support native MXFP4/MXFP6 computation. Simulated weight dequantization and activation QDQ (quantize and dequantize) will be used, with the linear layers computed in high precision.

and

VLLM Environment variable used 
-------------------------------------------
VLLM_LOGGING_LEVEL=DEBUG
VLLM_ROCM_USE_AITER_FP4BMM=0
VLLM_DISABLE_COMPILE_CACHE=1
VLLM_USE_TRITON_FLASH_ATTN=0
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
------------------------------------------------------------------

linzhao-amd

AMD org Mar 13

•

edited Mar 13

@twinsen123 Yes. It uses emulative QDQ.
We haven’t tested this on MI300, so I’m not sure about the root cause of the issue. If you have multiple GPUs, you could try increasing the TP size, as a single MI300 (192 GB) may not be sufficient for emulative mode. Alternatively, you can launch the vLLM server in eager mode instead of using CUDA Graphs to avoid some potential issues:

vllm serve "$MODEL" \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 32768 \
   --enforce-eager \
    --port 8899

twinsen123

Mar 13

No luck. Still memory access fault by GPU node-1 (Agent handle: 0x3801e4a0) on address 0x74cce6200000. Reason: Unknown.

Script for booting up vllm

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FP4BMM=0 \
VLLM_DISABLE_COMPILE_CACHE=1 \
VLLM_ROCM_USE_AITER_MOE=0 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1 \
vllm serve "amd/MiniMax-M2.1-MXFP4" \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enforce-eager \
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
    --port 8899

I think able to run optimized quanitization model in MI300x1 would boost up the AMD GPU adoption as MI300x1 is widely available and price-friendly compared to NVIDIA GPU. See whether there is any roadmap for better support MI300x

linzhao-amd

AMD org Mar 13

•

edited Mar 13

The GPU hang issue may be due to memory limitations, as in emulation mode the model weights are dequantized to BF16, which is four times the size of the MXFP4 format. The pretrained FP8 model is about 230G: see https://huggingface.co/MiniMaxAI/MiniMax-M2.1/tree/main.
For the MI300X roadmap, would you please refer to this link: https://rocm.docs.amd.com/en/latest/

XuebinWang

AMD org Mar 16

Thanks @twinsen123 for reporting the issue. The "memory access fault" issue can be reproduced, we needs some extra time to fix it.
As a suggestion, please consider using MI350/355 on which the model can be successfully launched.

XuebinWang

AMD org May 8

Upgrading amd-quark to v0.11.2 in the vllm environment can fix the "memory access fault" issue.

XuebinWang changed discussion status to closed May 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment