Thanks, thanks and more thanks. Many thanks.
I really appreciate you releasing this. I've been using the NVIDIA M.25 NVFP4 and it is what I call stupid fast. It makes using any cloud service seem dog slow by comparison, even the best and most expensive ones (except maybe Grok).
Anyway, this 2.7 seems just a bit slower. Yes I know I should provide some proof or quantify it but I thought I would just ask before interrupting work to run benchmarks. Is 2.7 noticeably slower for you?
I'm running vllm nightly on Apr 12, GPU setup is 2x RTX PRO 6000 Blackwell
what recipe are you using to launch?
QuantTrio/MiniMax-M2.5-AWQ - 111t/s
lukealonso/MiniMax-M2.7-NVFP4 - 91t/s
recipe for both:
vllm serve
$modeldir
--served-model-name $modelname
--dtype auto
--max-num-seqs 16
--max-model-len $maxmodellen
--gpu-memory-utilization 0.92
--tensor-parallel-size 2
--enable-prefix-caching
--enable-auto-tool-choice
--enable-chunked-prefill
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--trust-remote-code
--host 0.0.0.0
--port 8000
Man, sglang with the docker Luke recommended is the way to go right now.
Service logs
April 13, 2026
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 5, #token: 18020, token usage: 0.09, cuda graph: True, gen throughput (token/s): 303.41, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:59] INFO: 100.71.61.79:62814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 6, #token: 20077, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.46, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:58 TP0] Decode batch, #running-req: 6, #token: 19837, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.24, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:57 TP0] Decode batch, #running-req: 6, #token: 19597, token usage: 0.10, cuda graph: True, gen throughput (token/s): 316.57, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56 TP0] Decode batch, #running-req: 6, #token: 19357, token usage: 0.10, cuda graph: True, gen throughput (token/s): 346.67, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56] INFO: 100.71.61.79:62813 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 21028, token usage: 0.11, cuda graph: True, gen throughput (token/s): 348.91, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 20748, token usage: 0.10, cuda graph: True, gen throughput (token/s): 347.85, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:54 TP0] Decode batch, #running-req: 7, #token: 20468, token usage: 0.10, cuda graph: True, gen throughput (token/s): 350.01, #queue-req: 0
docker
i'd love to get sglang working, but i can't get the docker image to work for me.
docker run --rm -it
--gpus all
-v /home/admin/ai/lukealonso_MiniMax-M2.7-NVFP4:/model
-e OMP_NUM_THREADS=16
-e SGLANG_ENABLE_SPEC_V2=True
-p 8000:8000
voipmonitor/sglang:cu130
python -m sglang.launch_server
--model-path /model
--served-model-name lukealonso_MiniMax-M2.7-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--tp 2
--enable-torch-compile
--trust-remote-code
--quantization modelopt_fp4
--kv-cache-dtype bf16
--moe-runner-backend b12x
--fp4-gemm-backend b12x
--attention-backend flashinfer
--mem-fraction-static 0.85
--host 0.0.0.0
--port 8000
[2026-04-13 07:21:56 TP0] Init torch distributed begin.
[2026-04-13 07:21:56 TP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-04-13 07:21:56 TP1] Init torch distributed begin.
[2026-04-13 07:21:56] Fixing v5 tokenizer component mismatch for /model: pre_tokenizer ByteLevel -> Sequence, decoder ByteLevel -> ByteLevel
[1/2] /usr/local/cuda/bin/nvcc -MD -MF pcie_allreduce.cuda.o.d -DTORCH_EXTENSION_NAME=pcie_allreduce_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/venv/lib/python3.12/site-packages/torch/include -isystem /opt/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120a,code=sm_120a --compiler-options '-fPIC' -O2 --expt-relaxed-constexpr -std=c++17 -c /opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu -o pcie_allreduce.cuda.o
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu: In destructor ‘pcie_allreduce::PCIeAllreduce::~PCIeAllreduce()’:
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: warning: ‘throw’ will always call ‘terminate’ [-Wterminate]
525 | for (auto [, ptr] : ipc_handles) CHECK_CUDA_SUCCESS(cudaIpcCloseMemHandle(ptr));
| ^
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: note: in C++11 destructors default to ‘noexcept’
[2/2] c++ pcie_allreduce.cuda.o -shared -lcuda -L/opt/venv/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o pcie_allreduce_ext.so
[2026-04-13 07:23:00 TP0] sglang is using nccl==2.29.7
4fd6053ab362:113:113 [0] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:113:113 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:113:113 [0] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:113:113 [0] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:114:114 [1] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:114:114 [1] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:114:114 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : No device found.
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:114:114 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:114:114 [1] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:114:114 [1] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:114:114 [1] NCCL INFO Using network Socket
4fd6053ab362:114:114 [1] NCCL INFO [Rank 1] ncclCommInitRank comm 0x38c31510 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:113:113 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : No device found.
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:113:113 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:113:113 [0] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:113:113 [0] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:113:113 [0] NCCL INFO Using network Socket
4fd6053ab362:113:113 [0] NCCL INFO [Rank 0] ncclCommInitRank comm 0x37f04f40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 46000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap timings total 0.003376 (create 0.000041, send 0.000142, recv 0.002518, ring 0.000044fd6053ab362:113:113 [0] NCCL INFO Bootstrap timings total 0.001124 (create 0.000040, send 0.000131, recv 0.000444, ring 0.000041, delay 0.000001)
7, delay 0.000002)
4fd6053ab362:114:114 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 0 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:113:113 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 1 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:114:114 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO comm 0x37f04f40 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
4fd6053ab362:114:114 [1] NCCL INFO comm 0x38c31510 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/02 : 0 1
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/02 : 0 1
4fd6053ab362:114:114 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
4fd6053ab362:113:113 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
4fd6053ab362:114:114 [1] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:114:114 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:113:113 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:114 [1] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:387 [0] NCCL INFO [Proxy Service] Device 1 CPU core 21
4fd6053ab362:114:389 [0] NCCL INFO [Proxy Service UDS] Device 1 CPU core 30
4fd6053ab362:113:388 [0] NCCL INFO [Proxy Service] Device 0 CPU core 10
4fd6053ab362:113:390 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 29
4fd6053ab362:114:114 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:114:114 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:114:114 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:113:113 [0] NCCL INFO Connected all trees
4fd6053ab362:114:114 [1] NCCL INFO Connected all trees
[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-QDYemn to 34210180 bytes, error: No space left on device (28)
[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-QDYemn (size 34210176), error: No space left on device (28)
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:114:114 [1] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2252 -> 2
[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-JSqlX5 to 34210180 bytes, error: No space left on device (28)
[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-JSqlX5 (size 34210176), error: No space left on device (28)
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:113:113 [0] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2252 -> 2
[2026-04-13 07:23:00 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.
[2026-04-13 07:23:00 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.
300+ tokens/s @atrix !!? i'm going to have to look into the sglang setup. i'm getting nowhere near that with vllm. i'm currently starting vllm with
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/data/models/MiniMax-M2.7-NVFP4 \
--host 0.0.0.0 \
--port 1235 \
--served-model-name minimax-m2 \
--trust-remote-code \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--max-model-len 147456 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--disable-custom-all-reduce \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
i'd love to get sglang working, but i can't get the docker image to work for me.
It looks like you're missing --shm-size, docker defaults to 64mb shared memory which isn't enough.... nccl needs much more than that.
try adding:
--shm-size 32g
Alternatively, here's my docker-compose.yaml if you want the easy button. Assuming you installed docker, Nvidia-container toolkits, all that. Note that I'm using fp8_e4m3 for kv-cache right now testing to get a bit more context into it on my 2x RTX-6000's, but you can also use --kv-cache-dtype bf16 if you like that better or have better hardware.
docker-compose.yml:
services:
minimax:
image: voipmonitor/sglang:cu130
container_name: minimax-m27
shm_size: 32g
ports:
- "8001:5000"
volumes:
- ~/LLM:/models
environment:
- OMP_NUM_THREADS=16
- SGLANG_ENABLE_SPEC_V2=True
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
command: >
python -m sglang.launch_server
--model-path /models/minimax_m27_nvfp4
--served-model-name MiniMax-M2.7
--reasoning-parser minimax
--tool-call-parser minimax-m2
--tp 2
--enable-torch-compile
--trust-remote-code
--kv-cache-dtype fp8_e4m3
--quantization modelopt_fp4
--moe-runner-backend b12x
--fp4-gemm-backend b12x
--attention-backend flashinfer
--enable-pcie-oneshot-allreduce
--mem-fraction-static 0.93
--host 0.0.0.0 --port 5000
MiniMax 2.7 has analyzed his own docker logs. Over a few hours of coding agent use (OpenCode, Pi mostly), I'm seeing 50-70 tokens/s generation. Seems a but slower than 2.5 on same hardware. Same launch command.
vLLM Container Report (MiniMax-M2.7-NVFP4)
Report Date: 2026-04-13
Container: vllm
1. Infrastructure & Hardware
| Property | Value |
|---|---|
| GPU | 3× NVIDIA RTX PRO 6000 Blackwell Workstation Edition |
| GPU Memory (each) | 95,787 MiB (~98 GB) |
| Compute Capability | 12.0 (Blackwell arch) |
| CUDA Version | 12.9.1 |
| Total GPU Memory | ~294 GB across 3 GPUs |
| Container Memory (ShmSize) | 16 GB |
| Runtime | NVIDIA (nvidia-container-runtime) |
Note: Only GPUs 0 and 1 are used for inference (CUDA_VISIBLE_DEVICES=0,1). The third GPU appears to be unused or reserved.
2. Model Configuration
| Parameter | Value |
|---|---|
| Model | /mnt/data/models/MiniMax-M2.7-NVFP4 |
| Architecture | MiniMaxM2ForCausalLM |
| Checkpoint Size | 125.19 GiB |
| Checkpoint Format | NVFP4 (ModelOpt, experimental) |
| Max Sequence Length | 147,456 tokens |
| dtype | torch.bfloat16 |
| Quantization | modelopt_fp4 |
| Tokenizer | /mnt/data/models/MiniMax-M2.7-NVFP4 |
| Trust Remote Code | true |
| HuggingFace Cache | /mnt/data/nvme0n1/models/huggingface |
3. vLLM Server Configuration (from start-vllm)
#!/usr/bin/bash
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/data/models/MiniMax-M2.7-NVFP4 \
--host 0.0.0.0 --port 1235 \
--served-model-name minimax-m2 \
--trust-remote-code \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--max-model-len 147456 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--disable-custom-all-reduce \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
4. Key Configuration Parameters Explained
4.1 Parallelism
| Parameter | Value | Description |
|---|---|---|
tensor-parallel-size |
2 | Model sharded across 2 GPUs via tensor parallelism |
pipeline-parallel-size |
1 | Default |
data-parallel-size |
1 | Default |
VLLM_WORKER_MULTIPROC_METHOD |
spawn |
Process spawn method for workers |
4.2 Memory Management
| Parameter | Value | Description |
|---|---|---|
gpu-memory-utilization |
0.95 | 95% of GPU memory allocated for KV cache and model |
max-model-len |
147,456 | Extremely long context window (256K tokens max allowed) |
max-num-batched-tokens |
16,384 | Max tokens processed in a single forward pass |
max-num-seqs |
64 | Max number of concurrent sequences in a batch |
disable-custom-all-reduce |
true | Disables custom NCCL-based all-reduce (P2P disabled) |
4.3 Performance Optimizations
| Parameter | Description |
|---|---|
enable-chunked-prefill=True |
Splits large prefill requests into chunks to reduce memory pressure |
enable-prefix-caching=True |
Caches KV caches for repeated prompt prefixes |
cudagraph_mode=FULL_AND_PIECEWISE |
CUDA graph capturing for both full-sequence and piecewise (mixed prefill-decode) workloads |
fuse_act_quant=True |
Fuses activation quantization for FP4 GEMM |
NVFP4 GEMM backend |
CUTLASS (not MARLIN) |
FlashAttention backend |
FLASH_ATTN v2 |
NVFP4 MoE backend |
VLLM_CUTLASS |
FlashInfer Autotune |
Enabled (completed successfully) |
4.4 Tool & Reasoning Support
| Parameter | Value |
|---|---|
enable-auto-tool-choice |
true |
tool-call-parser |
minimax_m2 |
reasoning-parser |
minimax_m2_append_think |
served-model-name |
minimax-m2 |
5. Startup & Initialization Timeline
| Phase | Duration |
|---|---|
| Model loading (26 safetensor shards) | ~40 seconds |
| Torch compile (AOT) | ~72 seconds |
| Dynamo bytecode transform | ~12.5 seconds |
| Graph compilation (1-16384 range) | ~18 seconds |
| CUDA graph capturing (PIECEWISE: 19 graphs, FULL: 11 graphs) | ~8 seconds |
| Total initialization time | ~176 seconds |
CUDA Graph Memory
| Metric | Value |
|---|---|
| Estimated | 1.33 GiB per GPU |
| Actual | 1.20 GiB per GPU |
| Difference | 0.13 GiB (10.7% overestimation) |
KV Cache
| Metric | Value |
|---|---|
| Available KV cache memory | 23.08 GiB per GPU (46.16 GiB total across 2 GPUs) |
| GPU KV cache size | 195,136 tokens (total across both GPUs) |
| Maximum concurrency | 1.32x for 147,456-token requests |
6. Detailed Performance Metrics
6.1 Token Throughput Summary
Prompt (Prefill) Throughput
| Scenario | Tokens/Second |
|---|---|
| Idle / No requests | 0.0 |
| Low-load (light requests) | 7–50 |
| Medium-load (moderate batching) | 50–350 |
| High-load (batch-heavy prefill) | 350–1,500 |
| Peak observed (cache hit) | 6,183 tokens/s |
| Peak observed (long prompt) | 5,967 tokens/s |
| Very long prompt burst | 5,609 tokens/s |
Generation (Decode) Throughput
| Scenario | Tokens/Second |
|---|---|
| Idle / No requests | 0.0 |
| Low-load (short response) | 4–15 |
| Medium-load (typical generation) | 15–40 |
| High-load (long streaming generation) | 40–58 |
| Peak observed | 75.1 tokens/s |
6.2 Performance Over Time (Notable Snapshots)
| Timestamp | Prompt (tok/s) | Gen (tok/s) | Running | GPU KV % | Prefix Cache % |
|---|---|---|---|---|---|
| 08:18:30 | 110.5 | 20.8 | 0 | 0.0% | 94.8% |
| 08:19:50 | 13.9 | 4.2 | 1 | 16.3% | 94.8% |
| 08:34:30 | 5,967.0 | 7.1 | 0 | 0.0% | 94.7% |
| 08:40:50 | 59.0 | 36.6 | 1 | 32.7% | 94.9% |
| 08:50:30 | 0.0 | 75.1 | 1 | 8.2% | 95.4% |
| 08:51:50 | 14.3 | 5.3 | 1 | 25.6% | 95.3% |
| 08:52:10 | 63.0 | 42.4 | 1 | 26.0% | 95.3% |
| 08:53:40 | 191.4 | 51.1 | 1 | 19.0% | 95.3% |
| 12:08:51 | 0.0 | 56.5 | 1 | 29.2% | 95.0% |
| 12:09:01 | 0.0 | 56.0 | 1 | 29.5% | 95.0% |
| 12:09:11 | 0.0 | 55.6 | 1 | 29.7% | 95.0% |
| 12:09:21 | 0.0 | 55.1 | 1 | 30.0% | 95.0% |
| 12:12:21 | 0.0 | 55.3 | 1 | 30.8% | 95.0% |
| 12:12:31 | 0.0 | 54.8 | 1 | 31.1% | 95.0% |
6.3 Key Performance Observations
Generation Throughput Stability: Under sustained single-request generation loads, the model consistently achieves 55–75 tokens/second.
High Prefix Cache Hit Rate: The prefix cache hit rate stabilizes at ~95% after warmup, indicating excellent cache efficiency for repeated system prompts or conversation prefixes.
GPU KV Cache Usage: During active requests, GPU KV cache usage reaches ~30–38%, well within the available 46 GiB total.
Asynchronous Scheduling: Enabled, allowing better GPU utilization by overlapping prefill/decode operations.
Chunked Prefill: With
max_num_batched_tokens=16384, large prompts are chunked to avoid memory spikes.Peak Prompt Throughput: With prefix cache hits (94–95%), prompt processing bursts to 5,000–6,000+ tokens/second.
SymmMemCommunicator Warning: Device capability 12.0 is not supported by SymmMemCommunicator — this is expected for Blackwell (compute 12.0) as noted in the warning.
TensorFloat32 Warning: TensorFloat32 tensor cores available but not enabled. This could be enabled for faster FP32 matmul if accuracy is acceptable:
torch.set_float32_matmul_precision('high').
7. Network & API
| Property | Value |
|---|---|
| API Host | 0.0.0.0:1235 |
| Exposed Port | 1235/tcp → Host 8080 |
| API Protocol | OpenAI-compatible REST |
| Endpoints | /v1/chat/completions, /v1/completions, /v1/models, /health, /metrics, etc. |
| Client IPs | 172.20.3.171, 172.20.3.167 |
| Default Sampling Params | temperature=1.0, top_k=40, top_p=0.95 (overridden from generation_config.json) |
8. Summary
| Metric | Value |
|---|---|
| Model | MiniMax-M2.7-NVFP4 (125 GB checkpoint, NVFP4 quantized) |
| Serving Stack | vLLM 0.19.1rc1 nightly |
| Hardware | 2× NVIDIA RTX PRO 6000 Blackwell (98 GB each) |
| Tensor Parallelism | 2 |
| Max Context | 147,456 tokens |
| KV Cache Capacity | 195,136 tokens across 2 GPUs |
| Peak Prefill Throughput | ~6,183 tokens/s (with prefix cache) |
| Sustained Generation Throughput | 55–75 tokens/s |
| Prefix Cache Hit Rate (warm) | ~95% |
| CUDA Graph | Full + Piecewise (19 piecewise + 11 full graphs) |
| Attention Backend | FlashAttention v2 |
| MoE GEMM Backend | CUTLASS |
| Startup Time | ~176 seconds |
| Tool Support | Auto tool choice via minimax_m2 parser |
| Reasoning Parser | minimax_m2_append_think |
@dareposte i've tried the recipe you provided with dual RTX PRO 6000. For me it was faster than vllm under every scenario, as far as I could tell. In some cases it seemed to cook over TWICE as fast. I've only used it for a couple hours, but barring any major issues, this will be my daily driver. SO. FREAKING. FAST!