Thanks, thanks and more thanks. Many thanks.

by aaron-newsome - opened Apr 13

Apr 13

I really appreciate you releasing this. I've been using the NVIDIA M.25 NVFP4 and it is what I call stupid fast. It makes using any cloud service seem dog slow by comparison, even the best and most expensive ones (except maybe Grok).

Anyway, this 2.7 seems just a bit slower. Yes I know I should provide some proof or quantify it but I thought I would just ask before interrupting work to run benchmarks. Is 2.7 noticeably slower for you?

I'm running vllm nightly on Apr 12, GPU setup is 2x RTX PRO 6000 Blackwell

atrix

Apr 13

what recipe are you using to launch?

AgileTurnip

Apr 13

QuantTrio/MiniMax-M2.5-AWQ - 111t/s
lukealonso/MiniMax-M2.7-NVFP4 - 91t/s

recipe for both:

vllm serve
$modeldir
--served-model-name $modelname
--dtype auto
--max-num-seqs 16
--max-model-len $maxmodellen
--gpu-memory-utilization 0.92
--tensor-parallel-size 2
--enable-prefix-caching
--enable-auto-tool-choice
--enable-chunked-prefill
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--trust-remote-code
--host 0.0.0.0
--port 8000

atrix

Apr 13

Man, sglang with the docker Luke recommended is the way to go right now.

Service logs
April 13, 2026
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 5, #token: 18020, token usage: 0.09, cuda graph: True, gen throughput (token/s): 303.41, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:59] INFO: 100.71.61.79:62814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:59 TP0] Decode batch, #running-req: 6, #token: 20077, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.46, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:58 TP0] Decode batch, #running-req: 6, #token: 19837, token usage: 0.10, cuda graph: True, gen throughput (token/s): 312.24, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:57 TP0] Decode batch, #running-req: 6, #token: 19597, token usage: 0.10, cuda graph: True, gen throughput (token/s): 316.57, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56 TP0] Decode batch, #running-req: 6, #token: 19357, token usage: 0.10, cuda graph: True, gen throughput (token/s): 346.67, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:56] INFO: 100.71.61.79:62813 - "POST /v1/chat/completions HTTP/1.1" 200 OK
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 21028, token usage: 0.11, cuda graph: True, gen throughput (token/s): 348.91, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:55 TP0] Decode batch, #running-req: 7, #token: 20748, token usage: 0.10, cuda graph: True, gen throughput (token/s): 347.85, #queue-req: 0
docker
2:13 AM
[2026-04-13 06:13:54 TP0] Decode batch, #running-req: 7, #token: 20468, token usage: 0.10, cuda graph: True, gen throughput (token/s): 350.01, #queue-req: 0
docker

AgileTurnip

Apr 13

i'd love to get sglang working, but i can't get the docker image to work for me.

docker run --rm -it
--gpus all
-v /home/admin/ai/lukealonso_MiniMax-M2.7-NVFP4:/model
-e OMP_NUM_THREADS=16
-e SGLANG_ENABLE_SPEC_V2=True
-p 8000:8000
voipmonitor/sglang:cu130
python -m sglang.launch_server
--model-path /model
--served-model-name lukealonso_MiniMax-M2.7-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--tp 2
--enable-torch-compile
--trust-remote-code
--quantization modelopt_fp4
--kv-cache-dtype bf16
--moe-runner-backend b12x
--fp4-gemm-backend b12x
--attention-backend flashinfer
--mem-fraction-static 0.85
--host 0.0.0.0
--port 8000

[2026-04-13 07:21:56 TP0] Init torch distributed begin.
[2026-04-13 07:21:56 TP1] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-04-13 07:21:56 TP1] Init torch distributed begin.
[2026-04-13 07:21:56] Fixing v5 tokenizer component mismatch for /model: pre_tokenizer ByteLevel -> Sequence, decoder ByteLevel -> ByteLevel
[1/2] /usr/local/cuda/bin/nvcc -MD -MF pcie_allreduce.cuda.o.d -DTORCH_EXTENSION_NAME=pcie_allreduce_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/venv/lib/python3.12/site-packages/torch/include -isystem /opt/venv/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_120a,code=sm_120a --compiler-options '-fPIC' -O2 --expt-relaxed-constexpr -std=c++17 -c /opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu -o pcie_allreduce.cuda.o
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu: In destructor ‘pcie_allreduce::PCIeAllreduce::~PCIeAllreduce()’:
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: warning: ‘throw’ will always call ‘terminate’ [-Wterminate]
525 | for (auto [, ptr] : ipc_handles) CHECK_CUDA_SUCCESS(cudaIpcCloseMemHandle(ptr));
| ^
/opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: note: in C++11 destructors default to ‘noexcept’
[2/2] c++ pcie_allreduce.cuda.o -shared -lcuda -L/opt/venv/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o pcie_allreduce_ext.so
[2026-04-13 07:23:00 TP0] sglang is using nccl==2.29.7
4fd6053ab362:113:113 [0] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:113:113 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:113:113 [0] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:113:113 [0] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO ENV/Plugin: Could not find: libnccl-env.so
4fd6053ab362:114:114 [1] NCCL INFO cudaDriverVersion 13000
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO NCCL version 2.29.7+cuda13.2
4fd6053ab362:114:114 [1] NCCL INFO NCCL git version stable b81d6a5a3
4fd6053ab362:114:114 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:114:114 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : No device found.
4fd6053ab362:114:114 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:114:114 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:114:114 [1] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:114:114 [1] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:114:114 [1] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:114:114 [1] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:114:114 [1] NCCL INFO Using network Socket
4fd6053ab362:114:114 [1] NCCL INFO [Rank 1] ncclCommInitRank comm 0x38c31510 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so
4fd6053ab362:113:113 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO Failed to open libmlx5.so[.1]
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : No device found.
4fd6053ab362:113:113 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize NET plugin IB
4fd6053ab362:113:113 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
4fd6053ab362:113:113 [0] NCCL INFO Initialized NET plugin Socket
4fd6053ab362:113:113 [0] NCCL INFO Assigned NET plugin Socket to comm
4fd6053ab362:113:113 [0] NCCL INFO GIN/Plugin: Could not find: libnccl-gin.so
4fd6053ab362:113:113 [0] NCCL INFO Failed to initialize any GIN plugin
4fd6053ab362:113:113 [0] NCCL INFO Using network Socket
4fd6053ab362:113:113 [0] NCCL INFO [Rank 0] ncclCommInitRank comm 0x37f04f40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 46000 commId 0x504a6b25b462efc3 - Init START
4fd6053ab362:113:113 [0] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO RAS client listening socket at ::1<28028>
4fd6053ab362:114:114 [1] NCCL INFO Bootstrap timings total 0.003376 (create 0.000041, send 0.000142, recv 0.002518, ring 0.000044fd6053ab362:113:113 [0] NCCL INFO Bootstrap timings total 0.001124 (create 0.000040, send 0.000131, recv 0.000444, ring 0.000041, delay 0.000001)
7, delay 0.000002)
4fd6053ab362:114:114 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 0 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:113:113 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:114:114 [1] NCCL INFO ncclTopoGetCpuAffinity: Affinity for GPU 1 is empty, ignoring. (GPU affinity = ; CPU affinity = 0-47).
4fd6053ab362:114:114 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0.
4fd6053ab362:113:113 [0] NCCL INFO comm 0x37f04f40 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
4fd6053ab362:114:114 [1] NCCL INFO comm 0x38c31510 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/02 : 0 1
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/02 : 0 1
4fd6053ab362:114:114 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
4fd6053ab362:113:113 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
4fd6053ab362:114:114 [1] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO P2P Chunksize set to 131072
4fd6053ab362:113:113 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:114:114 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so
4fd6053ab362:113:113 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:114 [1] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0 isAllCudaP2p 1
4fd6053ab362:114:387 [0] NCCL INFO [Proxy Service] Device 1 CPU core 21
4fd6053ab362:114:389 [0] NCCL INFO [Proxy Service UDS] Device 1 CPU core 30
4fd6053ab362:113:388 [0] NCCL INFO [Proxy Service] Device 0 CPU core 10
4fd6053ab362:113:390 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 29
4fd6053ab362:114:114 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
4fd6053ab362:114:114 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
4fd6053ab362:113:113 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:114:114 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
4fd6053ab362:113:113 [0] NCCL INFO Connected all trees
4fd6053ab362:114:114 [1] NCCL INFO Connected all trees

[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-QDYemn to 34210180 bytes, error: No space left on device (28)

[2026-04-13 07:23:00] 4fd6053ab362:114:387 [1] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-QDYemn (size 34210176), error: No space left on device (28)
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:114:387 [1] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:114:114 [1] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:114:114 [1] NCCL INFO init.cc:2252 -> 2

[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:88 NCCL WARN Error: failed to extend /dev/shm/nccl-JSqlX5 to 34210180 bytes, error: No space left on device (28)

[2026-04-13 07:23:00] 4fd6053ab362:113:388 [0] misc/shmutils.cc:133 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-JSqlX5 (size 34210176), error: No space left on device (28)
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1393 -> 2
4fd6053ab362:113:388 [0] NCCL INFO proxy.cc:1451 -> 2
4fd6053ab362:113:113 [0] NCCL INFO proxy.cc:1166 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1400 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:1707 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2225 -> 2
4fd6053ab362:113:113 [0] NCCL INFO init.cc:2252 -> 2
[2026-04-13 07:23:00 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.
[2026-04-13 07:23:00 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 3597, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 386, in init
self.init_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 630, in init_model_worker
self.init_tp_model_worker()
File "/opt/sglang/python/sglang/srt/managers/scheduler.py", line 598, in init_tp_model_worker
self.tp_worker = TpModelWorker(**worker_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 261, in init
self._init_model_runner()
File "/opt/sglang/python/sglang/srt/managers/tp_worker.py", line 344, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 402, in init
pre_model_load_memory = self.init_torch_distributed()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/model_executor/model_runner.py", line 956, in init_torch_distributed
initialize_model_parallel(
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1859, in initialize_model_parallel
_TP = init_model_parallel_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 1480, in init_model_parallel_group
return GroupCoordinator(
^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/parallel_state.py", line 358, in init
self.pynccl_comm = PyNcclCommunicator(
^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 113, in init
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 401, in ncclCommInitRank
self.NCCL_CHECK(
File "/opt/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 376, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

[2026-04-13 07:23:00] Received sigquit from a child process. It usually means the child failed.

aaron-newsome

Apr 13

300+ tokens/s @atrix !!? i'm going to have to look into the sglang setup. i'm getting nowhere near that with vllm. i'm currently starting vllm with

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/data/models/MiniMax-M2.7-NVFP4 \
  --host 0.0.0.0 \
  --port 1235 \
  --served-model-name minimax-m2 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 147456 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

dareposte

Apr 13

i'd love to get sglang working, but i can't get the docker image to work for me.

It looks like you're missing --shm-size, docker defaults to 64mb shared memory which isn't enough.... nccl needs much more than that.

try adding:
--shm-size 32g

Alternatively, here's my docker-compose.yaml if you want the easy button. Assuming you installed docker, Nvidia-container toolkits, all that. Note that I'm using fp8_e4m3 for kv-cache right now testing to get a bit more context into it on my 2x RTX-6000's, but you can also use --kv-cache-dtype bf16 if you like that better or have better hardware.

docker-compose.yml:

services:
    minimax:
      image: voipmonitor/sglang:cu130
      container_name: minimax-m27
      shm_size: 32g
      ports:
        - "8001:5000"
      volumes:
        - ~/LLM:/models
      environment:
        - OMP_NUM_THREADS=16
        - SGLANG_ENABLE_SPEC_V2=True
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                device_ids: ["0", "1"]
                capabilities: [gpu]
      command: >
        python -m sglang.launch_server
        --model-path /models/minimax_m27_nvfp4
        --served-model-name MiniMax-M2.7
        --reasoning-parser minimax
        --tool-call-parser minimax-m2
        --tp 2
        --enable-torch-compile
        --trust-remote-code
        --kv-cache-dtype fp8_e4m3
        --quantization modelopt_fp4
        --moe-runner-backend b12x
        --fp4-gemm-backend b12x
        --attention-backend flashinfer
        --enable-pcie-oneshot-allreduce
        --mem-fraction-static 0.93
        --host 0.0.0.0 --port 5000

aaron-newsome

Apr 13

MiniMax 2.7 has analyzed his own docker logs. Over a few hours of coding agent use (OpenCode, Pi mostly), I'm seeing 50-70 tokens/s generation. Seems a but slower than 2.5 on same hardware. Same launch command.

vLLM Container Report (MiniMax-M2.7-NVFP4)

Report Date: 2026-04-13
Container: vllm

1. Infrastructure & Hardware

Property	Value
GPU	3× NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU Memory (each)	95,787 MiB (~98 GB)
Compute Capability	12.0 (Blackwell arch)
CUDA Version	12.9.1
Total GPU Memory	~294 GB across 3 GPUs
Container Memory (ShmSize)	16 GB
Runtime	NVIDIA (nvidia-container-runtime)

Note: Only GPUs 0 and 1 are used for inference (CUDA_VISIBLE_DEVICES=0,1). The third GPU appears to be unused or reserved.

2. Model Configuration

Parameter	Value
Model	`/mnt/data/models/MiniMax-M2.7-NVFP4`
Architecture	`MiniMaxM2ForCausalLM`
Checkpoint Size	125.19 GiB
Checkpoint Format	NVFP4 (ModelOpt, experimental)
Max Sequence Length	147,456 tokens
dtype	`torch.bfloat16`
Quantization	`modelopt_fp4`
Tokenizer	`/mnt/data/models/MiniMax-M2.7-NVFP4`
Trust Remote Code	`true`
HuggingFace Cache	`/mnt/data/nvme0n1/models/huggingface`

3. vLLM Server Configuration (from `start-vllm`)

#!/usr/bin/bash
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python3 -m vllm.entrypoints.openai.api_server \
  --model /mnt/data/models/MiniMax-M2.7-NVFP4 \
  --host 0.0.0.0 --port 1235 \
  --served-model-name minimax-m2 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 147456 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

4. Key Configuration Parameters Explained

4.1 Parallelism

Parameter	Value	Description
`tensor-parallel-size`	2	Model sharded across 2 GPUs via tensor parallelism
`pipeline-parallel-size`	1	Default
`data-parallel-size`	1	Default
`VLLM_WORKER_MULTIPROC_METHOD`	`spawn`	Process spawn method for workers

4.2 Memory Management

Parameter	Value	Description
`gpu-memory-utilization`	0.95	95% of GPU memory allocated for KV cache and model
`max-model-len`	147,456	Extremely long context window (256K tokens max allowed)
`max-num-batched-tokens`	16,384	Max tokens processed in a single forward pass
`max-num-seqs`	64	Max number of concurrent sequences in a batch
`disable-custom-all-reduce`	true	Disables custom NCCL-based all-reduce (P2P disabled)

4.3 Performance Optimizations

Parameter	Description
`enable-chunked-prefill=True`	Splits large prefill requests into chunks to reduce memory pressure
`enable-prefix-caching=True`	Caches KV caches for repeated prompt prefixes
`cudagraph_mode=FULL_AND_PIECEWISE`	CUDA graph capturing for both full-sequence and piecewise (mixed prefill-decode) workloads
`fuse_act_quant=True`	Fuses activation quantization for FP4 GEMM
`NVFP4 GEMM backend`	CUTLASS (not MARLIN)
`FlashAttention backend`	FLASH_ATTN v2
`NVFP4 MoE backend`	VLLM_CUTLASS
`FlashInfer Autotune`	Enabled (completed successfully)

4.4 Tool & Reasoning Support

Parameter	Value
`enable-auto-tool-choice`	true
`tool-call-parser`	`minimax_m2`
`reasoning-parser`	`minimax_m2_append_think`
`served-model-name`	`minimax-m2`

5. Startup & Initialization Timeline

Phase	Duration
Model loading (26 safetensor shards)	~40 seconds
Torch compile (AOT)	~72 seconds
Dynamo bytecode transform	~12.5 seconds
Graph compilation (1-16384 range)	~18 seconds
CUDA graph capturing (PIECEWISE: 19 graphs, FULL: 11 graphs)	~8 seconds
Total initialization time	~176 seconds

CUDA Graph Memory

Metric	Value
Estimated	1.33 GiB per GPU
Actual	1.20 GiB per GPU
Difference	0.13 GiB (10.7% overestimation)

KV Cache

Metric	Value
Available KV cache memory	23.08 GiB per GPU (46.16 GiB total across 2 GPUs)
GPU KV cache size	195,136 tokens (total across both GPUs)
Maximum concurrency	1.32x for 147,456-token requests

6. Detailed Performance Metrics

6.1 Token Throughput Summary

Prompt (Prefill) Throughput

Scenario	Tokens/Second
Idle / No requests	0.0
Low-load (light requests)	7–50
Medium-load (moderate batching)	50–350
High-load (batch-heavy prefill)	350–1,500
Peak observed (cache hit)	6,183 tokens/s
Peak observed (long prompt)	5,967 tokens/s
Very long prompt burst	5,609 tokens/s

Generation (Decode) Throughput

Scenario	Tokens/Second
Idle / No requests	0.0
Low-load (short response)	4–15
Medium-load (typical generation)	15–40
High-load (long streaming generation)	40–58
Peak observed	75.1 tokens/s

6.2 Performance Over Time (Notable Snapshots)

Timestamp	Prompt (tok/s)	Gen (tok/s)	Running	GPU KV %	Prefix Cache %
08:18:30	110.5	20.8	0	0.0%	94.8%
08:19:50	13.9	4.2	1	16.3%	94.8%
08:34:30	5,967.0	7.1	0	0.0%	94.7%
08:40:50	59.0	36.6	1	32.7%	94.9%
08:50:30	0.0	75.1	1	8.2%	95.4%
08:51:50	14.3	5.3	1	25.6%	95.3%
08:52:10	63.0	42.4	1	26.0%	95.3%
08:53:40	191.4	51.1	1	19.0%	95.3%
12:08:51	0.0	56.5	1	29.2%	95.0%
12:09:01	0.0	56.0	1	29.5%	95.0%
12:09:11	0.0	55.6	1	29.7%	95.0%
12:09:21	0.0	55.1	1	30.0%	95.0%
12:12:21	0.0	55.3	1	30.8%	95.0%
12:12:31	0.0	54.8	1	31.1%	95.0%

6.3 Key Performance Observations

Generation Throughput Stability: Under sustained single-request generation loads, the model consistently achieves 55–75 tokens/second.
High Prefix Cache Hit Rate: The prefix cache hit rate stabilizes at ~95% after warmup, indicating excellent cache efficiency for repeated system prompts or conversation prefixes.
GPU KV Cache Usage: During active requests, GPU KV cache usage reaches ~30–38%, well within the available 46 GiB total.
Asynchronous Scheduling: Enabled, allowing better GPU utilization by overlapping prefill/decode operations.
Chunked Prefill: With max_num_batched_tokens=16384, large prompts are chunked to avoid memory spikes.
Peak Prompt Throughput: With prefix cache hits (94–95%), prompt processing bursts to 5,000–6,000+ tokens/second.
SymmMemCommunicator Warning: Device capability 12.0 is not supported by SymmMemCommunicator — this is expected for Blackwell (compute 12.0) as noted in the warning.
TensorFloat32 Warning: TensorFloat32 tensor cores available but not enabled. This could be enabled for faster FP32 matmul if accuracy is acceptable: torch.set_float32_matmul_precision('high').

7. Network & API

Property	Value
API Host	`0.0.0.0:1235`
Exposed Port	`1235/tcp` → Host `8080`
API Protocol	OpenAI-compatible REST
Endpoints	`/v1/chat/completions`, `/v1/completions`, `/v1/models`, `/health`, `/metrics`, etc.
Client IPs	`172.20.3.171`, `172.20.3.167`
Default Sampling Params	`temperature=1.0, top_k=40, top_p=0.95` (overridden from `generation_config.json`)

8. Summary

Metric	Value
Model	MiniMax-M2.7-NVFP4 (125 GB checkpoint, NVFP4 quantized)
Serving Stack	vLLM 0.19.1rc1 nightly
Hardware	2× NVIDIA RTX PRO 6000 Blackwell (98 GB each)
Tensor Parallelism	2
Max Context	147,456 tokens
KV Cache Capacity	195,136 tokens across 2 GPUs
Peak Prefill Throughput	~6,183 tokens/s (with prefix cache)
Sustained Generation Throughput	55–75 tokens/s
Prefix Cache Hit Rate (warm)	~95%
CUDA Graph	Full + Piecewise (19 piecewise + 11 full graphs)
Attention Backend	FlashAttention v2
MoE GEMM Backend	CUTLASS
Startup Time	~176 seconds
Tool Support	Auto tool choice via `minimax_m2` parser
Reasoning Parser	`minimax_m2_append_think`

aaron-newsome

Apr 13

appreciate the docker recipe @dareposte , i'm definitely going to try this!

aaron-newsome

Apr 13

@dareposte i've tried the recipe you provided with dual RTX PRO 6000. For me it was faster than vllm under every scenario, as far as I could tell. In some cases it seemed to cook over TWICE as fast. I've only used it for a couple hours, but barring any major issues, this will be my daily driver. SO. FREAKING. FAST!

AgileTurnip

Apr 14

@dareposte looks like --shm-size 32g got sglang working...

sglang 116t/s
vllm 91t/s

docker run --rm -it
--shm-size 32g
--gpus all
-v /home/admin/ai/lukealonso_MiniMax-M2.7-NVFP4:/model
-e OMP_NUM_THREADS=16
-e SGLANG_ENABLE_SPEC_V2=True
-e NCCL_DEBUG=INFO
-p 8000:8000
voipmonitor/sglang:cu130
python -m sglang.launch_server
--model-path /model
--served-model-name lukealonso_MiniMax-M2.7-NVFP4
--reasoning-parser minimax
--tool-call-parser minimax-m2
--tp 2
--enable-torch-compile
--trust-remote-code
--quantization modelopt_fp4
--kv-cache-dtype bf16
--moe-runner-backend b12x
--fp4-gemm-backend b12x
--attention-backend flashinfer
--mem-fraction-static 0.85
--host 0.0.0.0
--port 8000

luismiguelsaez

Apr 14

i'd love to get sglang working, but i can't get the docker image to work for me.

It looks like you're missing --shm-size, docker defaults to 64mb shared memory which isn't enough.... nccl needs much more than that.

try adding:
--shm-size 32g

Alternatively, here's my docker-compose.yaml if you want the easy button. Assuming you installed docker, Nvidia-container toolkits, all that. Note that I'm using fp8_e4m3 for kv-cache right now testing to get a bit more context into it on my 2x RTX-6000's, but you can also use --kv-cache-dtype bf16 if you like that better or have better hardware.

docker-compose.yml:
services:
    minimax:
      image: voipmonitor/sglang:cu130
      container_name: minimax-m27
      shm_size: 32g
      ports:
        - "8001:5000"
      volumes:
        - ~/LLM:/models
      environment:
        - OMP_NUM_THREADS=16
        - SGLANG_ENABLE_SPEC_V2=True
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                device_ids: ["0", "1"]
                capabilities: [gpu]
      command: >
        python -m sglang.launch_server
        --model-path /models/minimax_m27_nvfp4
        --served-model-name MiniMax-M2.7
        --reasoning-parser minimax
        --tool-call-parser minimax-m2
        --tp 2
        --enable-torch-compile
        --trust-remote-code
        --kv-cache-dtype fp8_e4m3
        --quantization modelopt_fp4
        --moe-runner-backend b12x
        --fp4-gemm-backend b12x
        --attention-backend flashinfer
        --enable-pcie-oneshot-allreduce
        --mem-fraction-static 0.93
        --host 0.0.0.0 --port 5000

Have you really made this work on 2 Blackwell GPUs? I keep getting stuck after

sglang-minimax-m2.7-1             | [2026-04-14 05:57:35 TP0] sglang is using nccl==2.29.7

Same as with vLLM, which requires to disable NCCL; otherwise, it's stuck there with 100% GPU usage forever.

aaron-newsome

Apr 14

Yes, I was able to get the sglang compose to work on 2x RTX PRO 6000 Blackwells. Only a few tweaks from @dareposte compose file. Be patient though, it takes LONG time to startup and the first call to the LLM takes a long time too. One my system, I estimate about 15 minutes total. I thought it was hung but eventually it started cooking. My compose looks like this

services:
    minimax:
      image: voipmonitor/sglang:cu130
      container_name: minimax-m2
      shm_size: 32g
      ipc: host
      cap_add:
        - SYS_PTRACE
        - IPC_LOCK
      ulimits:
        memlock: -1
        stack: 67108864
      ports:
        - "8080:5000"
      volumes:
        - /mnt/data/models:/mnt/data/models
      environment:
        - OMP_NUM_THREADS=16
        - SGLANG_ENABLE_SPEC_V2=True
        - SGLANG_USE_PYNCCL=0
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                device_ids: ["0", "1"]
                capabilities: [gpu]
      command: >
        python -m sglang.launch_server
        --model-path /mnt/data/models/MiniMax-M2.7-NVFP4 
        --served-model-name minimax-m2 
        --reasoning-parser minimax
        --tool-call-parser minimax-m2
        --tp 2
        --enable-torch-compile
        --trust-remote-code
        --kv-cache-dtype fp8_e4m3
        --quantization modelopt_fp4
        --moe-runner-backend b12x
        --fp4-gemm-backend b12x
        --attention-backend flashinfer
        --enable-pcie-oneshot-allreduce
        --mem-fraction-static 0.95
        --host 0.0.0.0 --port 5000
        --context-length 180000
        --cuda-graph-max-bs 16
        --max-running-requests 64

luismiguelsaez

Apr 14

For me it's stuck here forever

minimax-m2  | /opt/sglang/python/sglang/srt/distributed/device_communicators/pcie_allreduce/pcie_allreduce.cu:525:355: note: in C++11 destructors default to ‘noexcept’
minimax-m2  | [2/2] c++ pcie_allreduce.cuda.o -shared -lcuda -L/opt/venv/lib/python3.12/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o pcie_allreduce_ext.so
minimax-m2  | [2026-04-14 13:13:06 TP0] sglang is using nccl==2.29.7

My topology is this

        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-47    0               N/A
GPU1    NODE     X      0-47    0               N/A

which means connections traverse PCIe. Maybe you have a different topology, but this seems to be the same issue that prevents me from using NCCL in vLLM.

Will keep investigating, thanks for the info!

orlandocollins

Apr 23

I had the same issue @luismiguelsaez make sure you have iommu turned on as a kernel parameter!

luismiguelsaez

Apr 23

I had the same issue @luismiguelsaez make sure you have iommu turned on as a kernel parameter!

Man, you nailed it, that was the issue! It fixes my vLLM and SGLang configurations, thanks for the help!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Thanks, thanks and more thanks. Many thanks.

vLLM Container Report (MiniMax-M2.7-NVFP4)

1. Infrastructure & Hardware

2. Model Configuration

3. vLLM Server Configuration (from start-vllm)

4. Key Configuration Parameters Explained

4.1 Parallelism

4.2 Memory Management

4.3 Performance Optimizations

4.4 Tool & Reasoning Support

5. Startup & Initialization Timeline

CUDA Graph Memory

KV Cache

6. Detailed Performance Metrics

6.1 Token Throughput Summary

Prompt (Prefill) Throughput

Generation (Decode) Throughput

6.2 Performance Over Time (Notable Snapshots)

6.3 Key Performance Observations

7. Network & API

8. Summary

3. vLLM Server Configuration (from `start-vllm`)