Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,7 @@ base_model:
|
|
| 11 |
**Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
|
| 12 |
|
| 13 |
## Notes
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
**Calibration:** 256 samples @ 4096 from Rombo-Org/Optimized_Reasoning
|
| 12 |
|
| 13 |
## Notes
|
| 14 |
+
As of now I've been unable to get this to run in VLLM. Using a similar setup as the other NVFP4 quant of MiniMax-M2.1 still fails to run. There's probably a trick to get this running but as of right now I don't know what it is. I'll probably open an issue on the VLLM github to see if they can give some guidance.
|
| 15 |
+
```sh
|
| 16 |
+
sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e NCCL_IB_DISABLE=1 -e NCCL_NVLS_ENABLE=0 -e NCCL_P2P_DISABLE=0 -e NCCL_SHM_DISABLE=0 -e VLLM_USE_V1=1 -e SAFETENSORS_FAST_GPU=1 vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 --model Firworks/MiniMax-M2.1-nvfp4 --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --trust_remote_code --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --enable-expert-parallel
|
| 17 |
+
```
|