Firworks commited on
Commit
e9f1bbb
·
verified ·
1 Parent(s): edf1c2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -11,4 +11,7 @@ base_model:
11
  **Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
12
 
13
  ## Notes
14
- Validation has not been done yet. Model card will be updated with details about the patching that was required to run this quantization and VLLM commands to run the model and
 
 
 
 
11
  **Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
12
 
13
  ## Notes
14
+ As of now I've been unable to get this to run in VLLM. Using a similar setup as the other NVFP4 quant of MiniMax-M2.1 still fails to run. There's probably a trick to get this running but as of right now I don't know what it is. I'll probably open an issue on the VLLM github to see if they can give some guidance.
15
+ ```sh
16
+ sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e NCCL_IB_DISABLE=1 -e NCCL_NVLS_ENABLE=0 -e NCCL_P2P_DISABLE=0 -e NCCL_SHM_DISABLE=0 -e VLLM_USE_V1=1 -e SAFETENSORS_FAST_GPU=1 vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 --model Firworks/MiniMax-M2.1-nvfp4 --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --trust_remote_code --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --enable-expert-parallel
17
+ ```