MiniMax-M2.1-nvfp4
Format: NVFP4 (W4A4)
Base model: MiniMaxAI/MiniMax-M2.1 (via QuixiAI/MiniMax-M2.1-bf16)
Calibration: 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
Notes
As of now I've been unable to get this to run in VLLM. Using a similar setup as the other NVFP4 quant of MiniMax-M2.1 still fails to run. There's probably a trick to get this running but as of right now I don't know what it is. I'll probably open an issue on the VLLM github to see if they can give some guidance.
sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e NCCL_IB_DISABLE=1 -e NCCL_NVLS_ENABLE=0 -e NCCL_P2P_DISABLE=0 -e NCCL_SHM_DISABLE=0 -e VLLM_USE_V1=1 -e SAFETENSORS_FAST_GPU=1 vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 --model Firworks/MiniMax-M2.1-nvfp4 --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --trust_remote_code --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --enable-expert-parallel
This was tested on an 2 x RTX Pro 6000 Blackwell cloud instance.
If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.
- Downloads last month
- 2
Model tree for Firworks/MiniMax-M2.1-nvfp4
Base model
MiniMaxAI/MiniMax-M2.1