Thanks, thanks and more thanks. Many thanks.

#2
by aaron-newsome - opened

I really appreciate you releasing this. I've been using the NVIDIA M.25 NVFP4 and it is what I call stupid fast. It makes using any cloud service seem dog slow by comparison, even the best and most expensive ones (except maybe Grok).

Anyway, this 2.7 seems just a bit slower. Yes I know I should provide some proof or quantify it but I thought I would just ask before interrupting work to run benchmarks. Is 2.7 noticeably slower for you?

I'm running vllm nightly on Apr 12, GPU setup is 2x RTX PRO 6000 Blackwell

what recipe are you using to launch?

QuantTrio/MiniMax-M2.5-AWQ - 111t/s
lukealonso/MiniMax-M2.7-NVFP4 - 91t/s

recipe for both:

vllm serve
$modeldir
--served-model-name $modelname
--dtype auto
--max-num-seqs 16
--max-model-len $maxmodellen
--gpu-memory-utilization 0.92
--tensor-parallel-size 2
--enable-prefix-caching
--enable-auto-tool-choice
--enable-chunked-prefill
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--trust-remote-code
--host 0.0.0.0
--port 8000

Man, sglang with the docker Luke recommended is the way to go right now.

Service logs
April 13, 2026
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 5, #token: 18020, token usage: 0.09, cuda graph: True, gen throughput (token/s): 303.41, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:59] INFO: 100.71.61.79:62814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 6, #token: 20077, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.46, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:58 TP0] Decode batch, #running-req: 6, #token: 19837, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.24, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:57 TP0] Decode batch, #running-req: 6, #token: 19597, token usage: 0.10, cuda graph: True, gen throughput (token/s): 316.57, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56 TP0] Decode batch, #running-req: 6, #token: 19357, token usage: 0.10, cuda graph: True, gen throughput (token/s): 346.67, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56] INFO: 100.71.61.79:62813 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 21028, token usage: 0.11, cuda graph: True, gen throughput (token/s): 348.91, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 20748, token usage: 0.10, cuda graph: True, gen throughput (token/s): 347.85, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:54 TP0] Decode batch, #running-req: 7, #token: 20468, token usage: 0.10, cuda graph: True, gen throughput (token/s): 350.01, #queue-req: 0
docker

i'd love to get sglang working, but i can't get the docker image to work for me.

docker run --rm -it
--gpus all
-v /home/admin/ai/lukealonso_MiniMax-M2.7-NVFP4:/model
-e OMP_NUM_THREADS=16
-e SGLANG_ENABLE_SPEC_V2=True
-p 8000:8000
voipmonitor/sglang:cu130
python -m sglang.launch_server
--model-path /model
--served-model-name lukealonso_MiniMax-M2.7-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--tp 2
--enable-torch-compile
--trust-remote-code
--quantization modelopt_fp4
--kv-cache-dtype bf16
--moe-runner-backend b12x
--fp4-gemm-backend b12x
--attention-backend flashinfer
--mem-fraction-static 0.85
--host 0.0.0.0
--port 8000

[2026-04-13 07:21:56 TP0] Init torch distributed begin.
[2026-04-13 07:21:56 TP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-04-13 07:21:56 TP1] Init torch distributed begin.
[2026-04-13 07:21:56] Fixing v5 tokenizer component mismatch for /model: pre_tokenizer ByteLevel -> Sequence, decoder ByteLevel -> ByteLevel
[1/2] /usr/local/cuda/bin/nvcc -MD -MF pcie_allreduce.cuda.o.d -DTORCH_EXTENSION_NAME=pcie_allreduce_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/venv/lib/python3.12/site-packages/torch/include -isystem /opt/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120a,code=sm_120a --compiler-options '-fPIC' -O2 --expt-relaxed-constexpr -std=c++17 -c /opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu -o pcie_allreduce.cuda.o
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu: In destructor ‘pcie_allreduce::PCIeAllreduce::~PCIeAllreduce()’:
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: warning: ‘throw’ will always call ‘terminate’ [-Wterminate]
525 | for (auto [, ptr] : ipc_handles) CHECK_CUDA_SUCCESS(cudaIpcCloseMemHandle(ptr));
| ^
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: note: in C++11 destructors default to ‘noexcept’
[2/2] c++ pcie_allreduce.cuda.o -shared -lcuda -L/opt/venv/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o pcie_allreduce_ext.so
[2026-04-13 07:23:00 TP0] sglang is using nccl==2.29.7
4fd6053ab362:113:113 [0] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:113:113 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:113:113 [0] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:113:113 [0] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:114:114 [1] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:114:114 [1] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:114:114 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : No device found.
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:114:114 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:114:114 [1] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:114:114 [1] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:114:114 [1] NCCL INFO Using network Socket
4fd6053ab362:114:114 [1] NCCL INFO [Rank 1] ncclCommInitRank comm 0x38c31510 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:113:113 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : No device found.
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:113:113 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:113:113 [0] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:113:113 [0] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:113:113 [0] NCCL INFO Using network Socket
4fd6053ab362:113:113 [0] NCCL INFO [Rank 0] ncclCommInitRank comm 0x37f04f40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 46000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap timings total 0.003376 (create 0.000041, send 0.000142, recv 0.002518, ring 0.000044fd6053ab362:113:113 [0] NCCL INFO Bootstrap timings total 0.001124 (create 0.000040, send 0.000131, recv 0.000444, ring 0.000041, delay 0.000001)
7, delay 0.000002)
4fd6053ab362:114:114 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 0 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:113:113 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 1 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:114:114 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO comm 0x37f04f40 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
4fd6053ab362:114:114 [1] NCCL INFO comm 0x38c31510 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/02 : 0 1
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/02 : 0 1
4fd6053ab362:114:114 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
4fd6053ab362:113:113 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
4fd6053ab362:114:114 [1] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:114:114 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:113:113 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:114 [1] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:387 [0] NCCL INFO [Proxy Service] Device 1 CPU core 21
4fd6053ab362:114:389 [0] NCCL INFO [Proxy Service UDS] Device 1 CPU core 30
4fd6053ab362:113:388 [0] NCCL INFO [Proxy Service] Device 0 CPU core 10
4fd6053ab362:113:390 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 29
4fd6053ab362:114:114 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:114:114 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:114:114 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:113:113 [0] NCCL INFO Connected all trees
4fd6053ab362:114:114 [1] NCCL INFO Connected all trees

[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-QDYemn to 34210180 bytes, error: No space left on device (28)

[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-QDYemn (size 34210176), error: No space left on device (28)
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:114:114 [1] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2252 -> 2

[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-JSqlX5 to 34210180 bytes, error: No space left on device (28)

[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-JSqlX5 (size 34210176), error: No space left on device (28)
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:113:113 [0] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2252 -> 2
[2026-04-13 07:23:00 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.
[2026-04-13 07:23:00 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.

300+ tokens/s @atrix !!? i'm going to have to look into the sglang setup. i'm getting nowhere near that with vllm. i'm currently starting vllm with

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/data/models/MiniMax-M2.7-NVFP4 \
  --host 0.0.0.0 \
  --port 1235 \
  --served-model-name minimax-m2 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 147456 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

i'd love to get sglang working, but i can't get the docker image to work for me.

It looks like you're missing --shm-size, docker defaults to 64mb shared memory which isn't enough.... nccl needs much more than that.

try adding:
--shm-size 32g

Alternatively, here's my docker-compose.yaml if you want the easy button. Assuming you installed docker, Nvidia-container toolkits, all that. Note that I'm using fp8_e4m3 for kv-cache right now testing to get a bit more context into it on my 2x RTX-6000's, but you can also use --kv-cache-dtype bf16 if you like that better or have better hardware.

docker-compose.yml:

services:
    minimax:
      image: voipmonitor/sglang:cu130
      container_name: minimax-m27
      shm_size: 32g
      ports:
        - "8001:5000"
      volumes:
        - ~/LLM:/models
      environment:
        - OMP_NUM_THREADS=16
        - SGLANG_ENABLE_SPEC_V2=True
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                device_ids: ["0", "1"]
                capabilities: [gpu]
      command: >
        python -m sglang.launch_server
        --model-path /models/minimax_m27_nvfp4
        --served-model-name MiniMax-M2.7
        --reasoning-parser minimax
        --tool-call-parser minimax-m2
        --tp 2
        --enable-torch-compile
        --trust-remote-code
        --kv-cache-dtype fp8_e4m3
        --quantization modelopt_fp4
        --moe-runner-backend b12x
        --fp4-gemm-backend b12x
        --attention-backend flashinfer
        --enable-pcie-oneshot-allreduce
        --mem-fraction-static 0.93
        --host 0.0.0.0 --port 5000

MiniMax 2.7 has analyzed his own docker logs. Over a few hours of coding agent use (OpenCode, Pi mostly), I'm seeing 50-70 tokens/s generation. Seems a but slower than 2.5 on same hardware. Same launch command.

vLLM Container Report (MiniMax-M2.7-NVFP4)

Report Date: 2026-04-13
Container: vllm


1. Infrastructure & Hardware

Property Value
GPU 3× NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU Memory (each) 95,787 MiB (~98 GB)
Compute Capability 12.0 (Blackwell arch)
CUDA Version 12.9.1
Total GPU Memory ~294 GB across 3 GPUs
Container Memory (ShmSize) 16 GB
Runtime NVIDIA (nvidia-container-runtime)

Note: Only GPUs 0 and 1 are used for inference (CUDA_VISIBLE_DEVICES=0,1). The third GPU appears to be unused or reserved.


2. Model Configuration

Parameter Value
Model /mnt/data/models/MiniMax-M2.7-NVFP4
Architecture MiniMaxM2ForCausalLM
Checkpoint Size 125.19 GiB
Checkpoint Format NVFP4 (ModelOpt, experimental)
Max Sequence Length 147,456 tokens
dtype torch.bfloat16
Quantization modelopt_fp4
Tokenizer /mnt/data/models/MiniMax-M2.7-NVFP4
Trust Remote Code true
HuggingFace Cache /mnt/data/nvme0n1/models/huggingface

3. vLLM Server Configuration (from start-vllm)

#!/usr/bin/bash
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/data/models/MiniMax-M2.7-NVFP4 \
  --host 0.0.0.0 --port 1235 \
  --served-model-name minimax-m2 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 147456 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

4. Key Configuration Parameters Explained

4.1 Parallelism

Parameter Value Description
tensor-parallel-size 2 Model sharded across 2 GPUs via tensor parallelism
pipeline-parallel-size 1 Default
data-parallel-size 1 Default
VLLM_WORKER_MULTIPROC_METHOD spawn Process spawn method for workers

4.2 Memory Management

Parameter Value Description
gpu-memory-utilization 0.95 95% of GPU memory allocated for KV cache and model
max-model-len 147,456 Extremely long context window (256K tokens max allowed)
max-num-batched-tokens 16,384 Max tokens processed in a single forward pass
max-num-seqs 64 Max number of concurrent sequences in a batch
disable-custom-all-reduce true Disables custom NCCL-based all-reduce (P2P disabled)

4.3 Performance Optimizations

Parameter Description
enable-chunked-prefill=True Splits large prefill requests into chunks to reduce memory pressure
enable-prefix-caching=True Caches KV caches for repeated prompt prefixes
cudagraph_mode=FULL_AND_PIECEWISE CUDA graph capturing for both full-sequence and piecewise (mixed prefill-decode) workloads
fuse_act_quant=True Fuses activation quantization for FP4 GEMM
NVFP4 GEMM backend CUTLASS (not MARLIN)
FlashAttention backend FLASH_ATTN v2
NVFP4 MoE backend VLLM_CUTLASS
FlashInfer Autotune Enabled (completed successfully)

4.4 Tool & Reasoning Support

Parameter Value
enable-auto-tool-choice true
tool-call-parser minimax_m2
reasoning-parser minimax_m2_append_think
served-model-name minimax-m2

5. Startup & Initialization Timeline

Phase Duration
Model loading (26 safetensor shards) ~40 seconds
Torch compile (AOT) ~72 seconds
Dynamo bytecode transform ~12.5 seconds
Graph compilation (1-16384 range) ~18 seconds
CUDA graph capturing (PIECEWISE: 19 graphs, FULL: 11 graphs) ~8 seconds
Total initialization time ~176 seconds

CUDA Graph Memory

Metric Value
Estimated 1.33 GiB per GPU
Actual 1.20 GiB per GPU
Difference 0.13 GiB (10.7% overestimation)

KV Cache

Metric Value
Available KV cache memory 23.08 GiB per GPU (46.16 GiB total across 2 GPUs)
GPU KV cache size 195,136 tokens (total across both GPUs)
Maximum concurrency 1.32x for 147,456-token requests

6. Detailed Performance Metrics

6.1 Token Throughput Summary

Prompt (Prefill) Throughput

Scenario Tokens/Second
Idle / No requests 0.0
Low-load (light requests) 7–50
Medium-load (moderate batching) 50–350
High-load (batch-heavy prefill) 350–1,500
Peak observed (cache hit) 6,183 tokens/s
Peak observed (long prompt) 5,967 tokens/s
Very long prompt burst 5,609 tokens/s

Generation (Decode) Throughput

Scenario Tokens/Second
Idle / No requests 0.0
Low-load (short response) 4–15
Medium-load (typical generation) 15–40
High-load (long streaming generation) 40–58
Peak observed 75.1 tokens/s

6.2 Performance Over Time (Notable Snapshots)

Timestamp Prompt (tok/s) Gen (tok/s) Running GPU KV % Prefix Cache %
08:18:30 110.5 20.8 0 0.0% 94.8%
08:19:50 13.9 4.2 1 16.3% 94.8%
08:34:30 5,967.0 7.1 0 0.0% 94.7%
08:40:50 59.0 36.6 1 32.7% 94.9%
08:50:30 0.0 75.1 1 8.2% 95.4%
08:51:50 14.3 5.3 1 25.6% 95.3%
08:52:10 63.0 42.4 1 26.0% 95.3%
08:53:40 191.4 51.1 1 19.0% 95.3%
12:08:51 0.0 56.5 1 29.2% 95.0%
12:09:01 0.0 56.0 1 29.5% 95.0%
12:09:11 0.0 55.6 1 29.7% 95.0%
12:09:21 0.0 55.1 1 30.0% 95.0%
12:12:21 0.0 55.3 1 30.8% 95.0%
12:12:31 0.0 54.8 1 31.1% 95.0%

6.3 Key Performance Observations

  1. Generation Throughput Stability: Under sustained single-request generation loads, the model consistently achieves 55–75 tokens/second.

  2. High Prefix Cache Hit Rate: The prefix cache hit rate stabilizes at ~95% after warmup, indicating excellent cache efficiency for repeated system prompts or conversation prefixes.

  3. GPU KV Cache Usage: During active requests, GPU KV cache usage reaches ~30–38%, well within the available 46 GiB total.

  4. Asynchronous Scheduling: Enabled, allowing better GPU utilization by overlapping prefill/decode operations.

  5. Chunked Prefill: With max_num_batched_tokens=16384, large prompts are chunked to avoid memory spikes.

  6. Peak Prompt Throughput: With prefix cache hits (94–95%), prompt processing bursts to 5,000–6,000+ tokens/second.

  7. SymmMemCommunicator Warning: Device capability 12.0 is not supported by SymmMemCommunicator — this is expected for Blackwell (compute 12.0) as noted in the warning.

  8. TensorFloat32 Warning: TensorFloat32 tensor cores available but not enabled. This could be enabled for faster FP32 matmul if accuracy is acceptable: torch.set_float32_matmul_precision('high').


7. Network & API

Property Value
API Host 0.0.0.0:1235
Exposed Port 1235/tcp → Host 8080
API Protocol OpenAI-compatible REST
Endpoints /v1/chat/completions, /v1/completions, /v1/models, /health, /metrics, etc.
Client IPs 172.20.3.171, 172.20.3.167
Default Sampling Params temperature=1.0, top_k=40, top_p=0.95 (overridden from generation_config.json)

8. Summary

Metric Value
Model MiniMax-M2.7-NVFP4 (125 GB checkpoint, NVFP4 quantized)
Serving Stack vLLM 0.19.1rc1 nightly
Hardware 2× NVIDIA RTX PRO 6000 Blackwell (98 GB each)
Tensor Parallelism 2
Max Context 147,456 tokens
KV Cache Capacity 195,136 tokens across 2 GPUs
Peak Prefill Throughput ~6,183 tokens/s (with prefix cache)
Sustained Generation Throughput 55–75 tokens/s
Prefix Cache Hit Rate (warm) ~95%
CUDA Graph Full + Piecewise (19 piecewise + 11 full graphs)
Attention Backend FlashAttention v2
MoE GEMM Backend CUTLASS
Startup Time ~176 seconds
Tool Support Auto tool choice via minimax_m2 parser
Reasoning Parser minimax_m2_append_think

appreciate the docker recipe @dareposte , i'm definitely going to try this!

@dareposte i've tried the recipe you provided with dual RTX PRO 6000. For me it was faster than vllm under every scenario, as far as I could tell. In some cases it seemed to cook over TWICE as fast. I've only used it for a couple hours, but barring any major issues, this will be my daily driver. SO. FREAKING. FAST!

Sign up or log in to comment