MiniMax-M2.1 / docs /vllm_deploy_guide.md
xuebi
udpate: guide use nightly vllm
a423be8

MiniMax M2.1 Model vLLM Deployment Guide

English Version | Chinese Version

We recommend using vLLM to deploy the MiniMax-M2.1 model. vLLM is a high-performance inference engine with excellent serving throughput, efficient and intelligent memory management, powerful batch request processing capabilities, and deeply optimized underlying performance. We recommend reviewing vLLM's official documentation to check hardware compatibility before deployment.

Applicable Models

This document applies to the following models. You only need to change the model name during deployment.

The deployment process is illustrated below using MiniMax-M2.1 as an example.

System Requirements

  • OS: Linux

  • Python: 3.9 - 3.12

  • GPU:

    • compute capability 7.0 or higher

    • Memory requirements: 220 GB for weights, 240 GB per 1M context tokens

The following are recommended configurations; actual requirements should be adjusted based on your use case:

  • 4x 96GB GPUs: Supported context length of up to 400K tokens.

  • 8x 144GB GPUs: Supported context length of up to 3M tokens.

Deployment with Python

It is recommended to use a virtual environment (such as venv, conda, or uv) to avoid dependency conflicts.

We recommend installing vLLM in a fresh Python environment:

uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Run the following command to start the vLLM server. vLLM will automatically download and cache the MiniMax-M2.1 model from Hugging Face.

4-GPU deployment command:

SAFETENSORS_FAST_GPU=1 vllm serve \
    MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think

8-GPU deployment command:

SAFETENSORS_FAST_GPU=1 vllm serve \
    MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think 

Testing Deployment

After startup, you can test the vLLM OpenAI-compatible API with the following command:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2.1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Common Issues

MiniMax-M2 model is not currently supported

This vLLM version is outdated. Please upgrade to the latest version.

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Add --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" to the startup parameters to resolve this issue. For example:

SAFETENSORS_FAST_GPU=1 vllm serve \
    MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
    --enable_expert_parallel --tensor-parallel-size 8 \
    --enable-auto-tool-choice --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"

Output is garbled

If you encounter corrupted output when using vLLM to serve these models, you can upgrade to the nightly version (ensure it is a version after commit cf3eacfe58fa9e745c2854782ada884a9f992cf7)

Getting Support

If you encounter any issues while deploying the MiniMax model:

  • Contact our technical support team through official channels such as email at model@minimax.io

  • Submit an issue on our GitHub repository

We continuously optimize the deployment experience for our models. Feedback is welcome!