This is perfect! Thank you!
@lukealonso thank you! This is amazing!
My config below (managed to get 91 tps single query, and about 1000 tps at 64 concurrent requests, using 2x rtx 6000 pro blackwell with vllm 0.14.0rc1.dev171+gd63b96967) :
#!/bin/bash
# Activate vLLM venv
source /opt/vllm/bin/activate
# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface
# Enables CUTLASS kernels for MXFP4/MXFP8 quantized MoE layers via FlashInfer.
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
# Selects the FlashInfer MoE backend strategy. Valid options:
# - throughput: Optimized for high throughput (batch processing)
# - latency: Optimized for low latency (single requests) - default
# - masked_gemm: Uses masked GEMM operations
#
# throughput (CUTLASS)
# - Optimized for batch processing - maximizes tokens/second across many concurrent requests
# - Better when you have multiple users/requests in the queue
# - Uses grouped GEMM with prepare/finalize stages
# - Works on SM90 (Hopper) and all Blackwell variants (SM100, SM120)
#
# latency (TensorRT-LLM)
# - Optimized for single-request latency - minimizes time-to-first-token and inter-token latency
# - Better for interactive/real-time applications with few concurrent users
# - Direct kernel path (no prepare/finalize overhead)
# - SM100 family only (B100, B200 data center GPUs)
# - Requires gated MoE (SiLU activation) - falls back to CUTLASS for non-gated
export VLLM_FLASHINFER_MOE_BACKEND=throughput
# Enables FlashInfer's FP4 (4-bit floating point) MoE kernels.
export VLLM_USE_FLASHINFER_MOE_FP4=1
# Controls how vLLM spawns worker processes. Options:
# - fork: Default on Linux, faster but can cause CUDA issues
# - spawn: Safer with CUDA, required for multi-GPU setups on some systems
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# Enables direct GPU tensor allocation when loading safetensors files, bypassing CPU→GPU copy.
# Still experimental but can provide ~2x faster model loading.
export SAFETENSORS_FAST_GPU=1
# NVIDIA system-level override enabling TF32 across all CUDA libraries (cuBLAS, cuDNN, TensorRT, etc.).
# TF32 uses 19-bit precision (10 mantissa) instead of FP32's 23 bits for internal matmul computations.
# Provides significant speedup with negligible accuracy loss for LLM inference.
# Note: Do NOT use TORCH_ALLOW_TF32_CUBLAS_OVERRIDE - it uses PyTorch's legacy API which conflicts
# with the new fp32_precision API used by torch.compile in PyTorch 2.9+.
export NVIDIA_TF32_OVERRIDE=1
# Selects the NCCL collective algorithm.
# Options: Tree, Ring, Collnet.
# Ring algorithm is typically better for PCIe-connected GPUs without NVLink (like like RTX 6000 Blackwell).
export NCCL_ALGO=Ring
# Selects the NCCL protocol. Options: LL (Low Latency), LL128, Simple.
# Simple protocol uses larger messages with less overhead, better for PCIe where latency is already high.
# LL/LL128 are optimized for NVLink's low-latency characteristics - not beneficial on PCIe.
export NCCL_PROTO=Simple
# Sets the minimum and maximum number of NCCL channels for parallel data transfer.
# More channels = more parallelism but also more memory overhead.
# For dual-GPU PCIe setups, 8-16 channels provides good throughput without excessive memory use.
# Default is 2-32, but on PCIe without NVLink, limiting to 16 max reduces memory pressure.
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
# Sets the NCCL buffer size per channel in bytes (16MB here).
# Larger buffers improve throughput by reducing synchronization overhead on high-latency PCIe links.
# Default is 4MB. We increase to 16MB because PCIe has higher latency than NVLink,
# so larger buffers help amortize that latency cost across more data per transfer.
export NCCL_BUFFSIZE=16777216
# Enables peer-to-peer direct GPU communication.
# RTX PRO 6000 Blackwell GPUs support P2P over PCIe (verified via torch.cuda.can_device_access_peer).
# P2P provides faster GPU-to-GPU transfers than the SHM fallback path (GPU → Host RAM → GPU).
export NCCL_P2P_DISABLE=0
# Disables InfiniBand transport probing.
# IB libraries are installed (libibverbs1) but no IB hardware exists on this system.
# This skips unnecessary device probing during NCCL initialization.
export NCCL_IB_DISABLE=1
# Disables NVLink SHARP/Scalable (NVLS) collective operations.
# NVLS requires NVLink connectivity between GPUs. This system uses PCIe + QPI/UPI (SYS topology).
# Explicitly disabling skips NVLS capability probing during initialization.
export NCCL_NVLS_ENABLE=0
# Enables shared memory transport (this is the default, explicitly set for documentation).
# With P2P enabled, SHM serves as fallback. Keeping enabled ensures reliability.
export NCCL_SHM_DISABLE=0
# Run command
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2.1-NVFP4 \
--host 0.0.0.0 \
--port 8353 \
--served-model-name minimax-m2.1 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--max-model-len 196608 \
--max-num-seqs 64 \
--max-num-batched-tokens 32768 \
--dtype auto \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--all2all-backend pplx \
--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
--enable-expert-parallel \
--enable-prefix-caching \
--enable-chunked-prefill \
--attention-config.backend FLASHINFER \
--kv-cache-dtype fp8_e4m3 \
--calculate-kv-scales \
# Model Card Info:
# We recommend using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.
#
# IMPORTANT:
# MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking
# content from the assistant's turns within the historical messages. In the model's output content, we use the
# <think>...</think> format to wrap the assistant's thinking content. When using the model, you must ensure that
# the historical content is passed back in its original format. Do not remove the <think>...</think> part,
# otherwise, the model's performance will be negatively affected.
#
# This vllm option keep thinking in the output - do not use minimax_m2 reasoning parser unless your application
# returns back the reasoning content as required.
# --reasoning-parser minimax_m2_append_think
Awesome! Thanks for the settings, I'll try them out.
I was unable to run model via vLLM using this config??, but SGLang working fine 69 tps (2 RTX 6000 Blackwell)
@Ja6ek my versions are these:
📊 Current versions
Python: 3.12.3
vLLM: 0.14.0rc1.dev171+gd63b96967
PyTorch: 2.9.1+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Triton: 3.5.1
FlashInfer: 0.5.3
If you post the error you get, I may be able to help.
My config :
Python: 3.12.12
vLLM: 0.14.0rc1.dev184+g715759610.cu130
PyTorch: 2.9.1+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Triton: 3.5.1
FlashInfer: 0.5.3
I have checked once again my SGLang docker script and see option: -e NCCL_P2P_DISABLE=1 but in vLLM is set to0. With option set to zero i vLLM i have never ending :
"servere_ai:600835:600921 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
servere_ai:600835:600921 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
servere_ai:600836:600922 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
servere_ai:600835:600921 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
(APIServer pid=600560) DEBUG 12-30 10:40:06 [v1/engine/utils.py:950] Waiting for 1 local,
0 remote core engine proc(s) to start. (APIServer pid=600560)
DEBUG 12-30 10:40:16 [v1/engine/utils.py:950] Waiting for 1 local, 0 remote core engine."
So it looks like this is NCCL issue, not
hm... the benefit of NCCL_P2P_DISABLE=0 is that it allows direct gpu-to-gpu communication. Without it, your gpus have to talk in 2 steps via CPU memory. Normally this is auto-detected, so if you don't set it, vllm may be able to detect it. Without P2P, my vllm performance is about 80 tps.