lukealonso
/

MiniMax-M2.1-NVFP4

8-bit precision

Model card Files Files and versions

MiniMax-M2.1-NVFP4 / README.md

lukealonso's picture

Update README.md

0abca8e verified 4 days ago

|

history blame contribute delete

1.25 kB

	---
	base_model:
	- MiniMaxAI/MiniMax-M2.1
	---

	modelopt NVFP4 quantized MiniMax-M2.1

	Works fine on 2x and 4x RTX 6000 Pro Blackwell via vLLM.

	Sample docker run (you will want to change this so it's not downloading the model repeatedly by mounting in your HF cache dir):

	```
	docker run -d \
	--gpus all \
	--ipc host \
	--shm-size 32g \
	--ulimit memlock=-1 \
	--ulimit nofile=1048576 \
	-v /dev/shm:/dev/shm \
	-e NCCL_IB_DISABLE=1 \
	-e NCCL_NVLS_ENABLE=0 \
	-e NCCL_P2P_DISABLE=0 \
	-e NCCL_SHM_DISABLE=0 \
	-e VLLM_USE_V1=1 \
	-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
	-e OMP_NUM_THREADS=8 \
	-e SAFETENSORS_FAST_GPU=1 \
	-p 0.0.0.0:8000:8000 \
	vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 \
	lukealonso/MiniMax-M2.1-NVFP4 \
	--host 0.0.0.0 \
	--port 8000 \
	--served-model-name "MiniMax-M2.1" \
	--enable-auto-tool-choice \
	--tool-call-parser minimax_m2 \
	--reasoning-parser minimax_m2_append_think \
	--trust_remote_code \
	--tensor-parallel-size 2 \
	--enable-expert-parallel \
	--dtype auto \
	--kv-cache-dtype fp8 \
	--gpu-memory-utilization 0.95 \
	--all2all-backend pplx \
	--enable-prefix-caching \
	--enable-chunked-prefill \
	--max-num-batched-tokens 16384 \
	--max-num-seqs 16
	```