|
|
--- |
|
|
base_model: |
|
|
- MiniMaxAI/MiniMax-M2.1 |
|
|
--- |
|
|
|
|
|
modelopt NVFP4 quantized MiniMax-M2.1 |
|
|
|
|
|
Works fine on 2x and 4x RTX 6000 Pro Blackwell via vLLM. |
|
|
|
|
|
Sample docker run (you will want to change this so it's not downloading the model repeatedly by mounting in your HF cache dir): |
|
|
|
|
|
``` |
|
|
docker run -d \ |
|
|
--gpus all \ |
|
|
--ipc host \ |
|
|
--shm-size 32g \ |
|
|
--ulimit memlock=-1 \ |
|
|
--ulimit nofile=1048576 \ |
|
|
-v /dev/shm:/dev/shm \ |
|
|
-e NCCL_IB_DISABLE=1 \ |
|
|
-e NCCL_NVLS_ENABLE=0 \ |
|
|
-e NCCL_P2P_DISABLE=0 \ |
|
|
-e NCCL_SHM_DISABLE=0 \ |
|
|
-e VLLM_USE_V1=1 \ |
|
|
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \ |
|
|
-e OMP_NUM_THREADS=8 \ |
|
|
-e SAFETENSORS_FAST_GPU=1 \ |
|
|
-p 0.0.0.0:8000:8000 \ |
|
|
vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 \ |
|
|
lukealonso/MiniMax-M2.1-NVFP4 \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 8000 \ |
|
|
--served-model-name "MiniMax-M2.1" \ |
|
|
--enable-auto-tool-choice \ |
|
|
--tool-call-parser minimax_m2 \ |
|
|
--reasoning-parser minimax_m2_append_think \ |
|
|
--trust_remote_code \ |
|
|
--tensor-parallel-size 2 \ |
|
|
--enable-expert-parallel \ |
|
|
--dtype auto \ |
|
|
--kv-cache-dtype fp8 \ |
|
|
--gpu-memory-utilization 0.95 \ |
|
|
--all2all-backend pplx \ |
|
|
--enable-prefix-caching \ |
|
|
--enable-chunked-prefill \ |
|
|
--max-num-batched-tokens 16384 \ |
|
|
--max-num-seqs 16 |
|
|
``` |