Firworks
/

MiniMax-M2.1-nvfp4

compressed-tensors

Model card Files Files and versions

Firworks commited on Dec 31, 2025

Commit

e9f1bbb

·

verified ·

1 Parent(s): edf1c2b

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -11,4 +11,7 @@ base_model:
 **Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
 ## Notes
-Validation has not been done yet. Model card will be updated with details about the patching that was required to run this quantization and VLLM commands to run the model and

 **Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
 ## Notes
+As of now I've been unable to get this to run in VLLM. Using a similar setup as the other NVFP4 quant of MiniMax-M2.1 still fails to run. There's probably a trick to get this running but as of right now I don't know what it is. I'll probably open an issue on the VLLM github to see if they can give some guidance.
+```sh
+sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e NCCL_IB_DISABLE=1 -e NCCL_NVLS_ENABLE=0 -e  NCCL_P2P_DISABLE=0 -e  NCCL_SHM_DISABLE=0 -e  VLLM_USE_V1=1 -e SAFETENSORS_FAST_GPU=1 vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 --model Firworks/MiniMax-M2.1-nvfp4 --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --trust_remote_code --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --enable-expert-parallel
+```