This is perfect! Thank you!

by ktsaou - opened Dec 30, 2025

Dec 30, 2025

•

@lukealonso thank you! This is amazing!

My config below (managed to get 91 tps single query, and about 1000 tps at 64 concurrent requests, using 2x rtx 6000 pro blackwell with vllm 0.14.0rc1.dev171+gd63b96967) :

#!/bin/bash

# Activate vLLM venv
source /opt/vllm/bin/activate

# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface

# Enables CUTLASS kernels for MXFP4/MXFP8 quantized MoE layers via FlashInfer.
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1

# Selects the FlashInfer MoE backend strategy. Valid options:
# - throughput: Optimized for high throughput (batch processing)
# - latency: Optimized for low latency (single requests) - default
# - masked_gemm: Uses masked GEMM operations
#
#  throughput (CUTLASS)
#  - Optimized for batch processing - maximizes tokens/second across many concurrent requests
#  - Better when you have multiple users/requests in the queue
#  - Uses grouped GEMM with prepare/finalize stages
#  - Works on SM90 (Hopper) and all Blackwell variants (SM100, SM120)
#
#  latency (TensorRT-LLM)
#  - Optimized for single-request latency - minimizes time-to-first-token and inter-token latency
#  - Better for interactive/real-time applications with few concurrent users
#  - Direct kernel path (no prepare/finalize overhead)
#  - SM100 family only (B100, B200 data center GPUs)
#  - Requires gated MoE (SiLU activation) - falls back to CUTLASS for non-gated
export VLLM_FLASHINFER_MOE_BACKEND=throughput

# Enables FlashInfer's FP4 (4-bit floating point) MoE kernels.
export VLLM_USE_FLASHINFER_MOE_FP4=1

# Controls how vLLM spawns worker processes. Options:
# - fork: Default on Linux, faster but can cause CUDA issues
# - spawn: Safer with CUDA, required for multi-GPU setups on some systems
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# Enables direct GPU tensor allocation when loading safetensors files, bypassing CPU→GPU copy.
# Still experimental but can provide ~2x faster model loading.
export SAFETENSORS_FAST_GPU=1

# NVIDIA system-level override enabling TF32 across all CUDA libraries (cuBLAS, cuDNN, TensorRT, etc.).
# TF32 uses 19-bit precision (10 mantissa) instead of FP32's 23 bits for internal matmul computations.
# Provides significant speedup with negligible accuracy loss for LLM inference.
# Note: Do NOT use TORCH_ALLOW_TF32_CUBLAS_OVERRIDE - it uses PyTorch's legacy API which conflicts
# with the new fp32_precision API used by torch.compile in PyTorch 2.9+.
export NVIDIA_TF32_OVERRIDE=1

# Selects the NCCL collective algorithm.
# Options: Tree, Ring, Collnet.
# Ring algorithm is typically better for PCIe-connected GPUs without NVLink (like like RTX 6000 Blackwell).
export NCCL_ALGO=Ring

# Selects the NCCL protocol. Options: LL (Low Latency), LL128, Simple.
# Simple protocol uses larger messages with less overhead, better for PCIe where latency is already high.
# LL/LL128 are optimized for NVLink's low-latency characteristics - not beneficial on PCIe.
export NCCL_PROTO=Simple

# Sets the minimum and maximum number of NCCL channels for parallel data transfer.
# More channels = more parallelism but also more memory overhead.
# For dual-GPU PCIe setups, 8-16 channels provides good throughput without excessive memory use.
# Default is 2-32, but on PCIe without NVLink, limiting to 16 max reduces memory pressure.
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16

# Sets the NCCL buffer size per channel in bytes (16MB here).
# Larger buffers improve throughput by reducing synchronization overhead on high-latency PCIe links.
# Default is 4MB. We increase to 16MB because PCIe has higher latency than NVLink,
# so larger buffers help amortize that latency cost across more data per transfer.
export NCCL_BUFFSIZE=16777216

# Enables peer-to-peer direct GPU communication.
# RTX PRO 6000 Blackwell GPUs support P2P over PCIe (verified via torch.cuda.can_device_access_peer).
# P2P provides faster GPU-to-GPU transfers than the SHM fallback path (GPU → Host RAM → GPU).
export NCCL_P2P_DISABLE=0

# Disables InfiniBand transport probing.
# IB libraries are installed (libibverbs1) but no IB hardware exists on this system.
# This skips unnecessary device probing during NCCL initialization.
export NCCL_IB_DISABLE=1

# Disables NVLink SHARP/Scalable (NVLS) collective operations.
# NVLS requires NVLink connectivity between GPUs. This system uses PCIe + QPI/UPI (SYS topology).
# Explicitly disabling skips NVLS capability probing during initialization.
export NCCL_NVLS_ENABLE=0

# Enables shared memory transport (this is the default, explicitly set for documentation).
# With P2P enabled, SHM serves as fallback. Keeping enabled ensures reliability.
export NCCL_SHM_DISABLE=0

# Run command
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2.1-NVFP4 \
  --host 0.0.0.0 \
  --port 8353 \
  --served-model-name minimax-m2.1 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --max-model-len 196608 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 32768 \
  --dtype auto \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --all2all-backend pplx \
  --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
  --enable-expert-parallel \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --attention-config.backend FLASHINFER \
  --kv-cache-dtype fp8_e4m3 \
  --calculate-kv-scales \

# Model Card Info:
# We recommend using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40.
#
# IMPORTANT:
# MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking
# content from the assistant's turns within the historical messages. In the model's output content, we use the
# <think>...</think> format to wrap the assistant's thinking content. When using the model, you must ensure that
# the historical content is passed back in its original format. Do not remove the <think>...</think> part,
# otherwise, the model's performance will be negatively affected.
#
# This vllm option keep thinking in the output - do not use minimax_m2 reasoning parser unless your application
# returns back the reasoning content as required.
# --reasoning-parser minimax_m2_append_think

lukealonso

Owner Dec 30, 2025

Awesome! Thanks for the settings, I'll try them out.

Ja6ek

Jan 1

I was unable to run model via vLLM using this config??, but SGLang working fine 69 tps (2 RTX 6000 Blackwell)

ktsaou

Jan 1

@Ja6ek my versions are these:

📊 Current versions
  Python: 3.12.3
  vLLM: 0.14.0rc1.dev171+gd63b96967
  PyTorch: 2.9.1+cu130
  CUDA: 13.0
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Triton: 3.5.1
  FlashInfer: 0.5.3

If you post the error you get, I may be able to help.

Ja6ek

Jan 1

My config :
Python: 3.12.12
vLLM: 0.14.0rc1.dev184+g715759610.cu130
PyTorch: 2.9.1+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Triton: 3.5.1
FlashInfer: 0.5.3

I have checked once again my SGLang docker script and see option: -e NCCL_P2P_DISABLE=1 but in vLLM is set to0. With option set to zero i vLLM i have never ending :
"servere_ai:600835:600921 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
servere_ai:600835:600921 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
servere_ai:600836:600922 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
servere_ai:600835:600921 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
(APIServer pid=600560) DEBUG 12-30 10:40:06 [v1/engine/utils.py:950] Waiting for 1 local,
0 remote core engine proc(s) to start. (APIServer pid=600560)
DEBUG 12-30 10:40:16 [v1/engine/utils.py:950] Waiting for 1 local, 0 remote core engine."

So it looks like this is NCCL issue, not

ktsaou

Jan 1

hm... the benefit of NCCL_P2P_DISABLE=0 is that it allows direct gpu-to-gpu communication. Without it, your gpus have to talk in 2 steps via CPU memory. Normally this is auto-detected, so if you don't set it, vllm may be able to detect it. Without P2P, my vllm performance is about 80 tps.

jjaxkp

Jan 2

Can someone look into my logs and diagnose the issue, I've tried to troubleshoot the whole stack but :(

(minimax) [yonas@Workstation vllm]$ export TORCH_CUDA_ARCH_LIST="12.0"
(minimax) [yonas@Workstation vllm]$ uv pip install -e .
Using Python 3.12.3 environment at: /home/yonas/minimax
Resolved 162 packages in 7.13s
Built vllm @ file:///home/yonas/vllm-source/vllm
Prepared 1 package in 15m 03s
Uninstalled 1 package in 30ms
Installed 1 package in 0.84ms

vllm==0.14.0rc1.dev171+gd63b96967.cu130

vllm==0.14.0rc1.dev171+gd63b96967.cu131 (from file:///home/yonas/vllm-source/vllm)
minimax) [yonas@Workstation vllm]$ export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP4=1 # Extra for NVFP4
vllm serve lukealonso/MiniMax-M2.1-NVFP4 --host 0.0.0.0 --port 8000 --served-model-name minimax-m2.1 --trust-remote-code --gpu-memory-utilization 0.94 --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 196608 --max-num-seqs 64 --max-num-batched-tokens 32768 --dtype auto --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --compilation-config "{"cudagraph_mode": "PIECEWISE"}" --enable-expert-parallel --enable-prefix-caching --enable-chunked-prefill --attention-config.backend FLASHINFER --kv-cache-dtype fp8_e4m3 --calculate-kv-scales
...
(Worker_TP1_EP1 pid=314604) ERROR ... subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/home/yonas/.cache/flashinfer/0.5.3/120a/cached_ops', '-f', '/home/yonas/.cache/flashinfer/0.5.3/120a/cached_ops/fp4_gemm_cutlass_sm120/build.ninja']' returned non-zero exit status 1.
...
(Worker_TP1_EP1 pid=314604) ERROR ... In file included from /home/yonas/.cache/flashinfer/0.5.3/120a/generated/gen_gemm_sm120_cutlass_fp4/fp4_gemm_cutlass___nv_bfloat16_128_128_128.cu:18:
(Worker_TP1_EP1 pid=314604) ERROR ... /home/yonas/minimax/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/gemm/fp4_gemm_cutlass_template_sm120.h:25:10: fatal error: cutlass/arch/arch.h: No such file or directory
(Worker_TP1_EP1 pid=314604) ERROR ... 25 | #include "cutlass/arch/arch.h"
...
(Worker_TP1_EP1 pid=314604) ERROR ... In file included from /home/yonas/minimax/lib/python3.12/site-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm120.cu:24:
(Worker_TP1_EP1 pid=314604) ERROR ... /home/yonas/minimax/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/gemm/cutlass_gemm_configs.h:24:10: fatal error: cute/tensor.hpp: No such file or directory
(Worker_TP1_EP1 pid=314604) ERROR ... 24 | #include "cute/tensor.hpp"
(Worker_TP1_EP1 pid=314604) ERROR ... | ^~~~~~~~~~~~~~~~~
(Worker_TP1_EP1 pid=314604) ERROR ... compilation terminated.
(Worker_TP1_EP1 pid=314604) ERROR ... ninja: build stopped: subcommand failed.
...
(EngineCore_DP0 pid=314578) RuntimeError: Worker failed with error 'Ninja build failed. Ninja output:
...
(APIServer pid=314526) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

ktsaou

Jan 2

•

edited Jan 2

@jjaxkp you don't need to compile vllm. Use something like this (it installs the latest nightly of vllm in /opt/vllm, assuming you have cuda 13):

#!/bin/bash
set -e

# CONFIGURATION - EDIT THESE
DST="/opt/vllm"           # Destination path for venv
CUDA_VERSION="cu130"      # Hardcoded CUDA version (cu128, cu129, cu130, etc.)

# Derived variables
PYTHON="$DST/bin/python"
VLLM_INDEX="https://wheels.vllm.ai/nightly/${CUDA_VERSION}/"
TORCH_INDEX="https://download.pytorch.org/whl/${CUDA_VERSION}/"
FLASHINFER_INDEX="https://flashinfer.ai/whl/nightly"

echo "=== Installing vLLM to $DST (CUDA: $CUDA_VERSION) ==="

# Create venv or detect mode, set UV_EXTRA accordingly
if [ ! -d "$DST" ]; then
    uv venv "$DST"
    UV_EXTRA="--reinstall-package vllm"
else
    echo "=== Update existing venv ==="
    UV_EXTRA=""
fi

uv pip install --python "$PYTHON" --upgrade $UV_EXTRA \
    vllm \
    torch \
    torchvision \
    torchaudio \
    nvidia-ml-py \
    xformers \
    llmcompressor \
    lmcache \
    nixl-cu13 \
    gpt-oss \
    "mistral-common>=1.8.6" \
    "numpy<2.3" \
    --pre \
    --index-url "$VLLM_INDEX" \
    --extra-index-url "$TORCH_INDEX" \
    --extra-index-url https://pypi.org/simple/ \
    --find-links "${FLASHINFER_INDEX}/flashinfer-python/" \
    --find-links "${FLASHINFER_INDEX}/flashinfer-cubin/" \
    --index-strategy unsafe-best-match

echo "=== Installation complete ==="
"$PYTHON" -c "import vllm; print(f'vLLM: {vllm.__version__}')"
"$PYTHON" -c "import torch; print(f'PyTorch: {torch.__version__}')"
"$PYTHON" -c "import triton; print(f'Triton: {triton.__version__}')"
"$PYTHON" -c "import flashinfer; print(f'FlashInfer: {getattr(flashinfer, \"__version__\", \"installed\")}')"

This installs these versions:

vLLM: 0.14.0rc1.dev212+gcc410e864
PyTorch: 2.9.1+cu130
Triton: 3.5.1
FlashInfer: 0.5.3

jjaxkp

Jan 2

I was able to fix it after reading a github issue with sglang with the flashinfer cache. I ran this script

#!/bin/bash

SM120 / Blackwell Build Remediation Script

Purpose: Sanitize environment and build vLLM/FlashInfer with NVFP4 support.

1. Clean Caches (Immediate User Fix)

echo "Step 1: Sanitizing Build Caches..."
flashinfer clear-cache
rm -rf ~/.cache/vllm
rm -rf ~/.cache/torch_extensions
rm -rf /home/yonas/.cache/flashinfer/ # Explicit path from user logs

2. Set Compiler Flags for Blackwell

This forces the build system to generate SM120 SASS/PTX

export TORCH_CUDA_ARCH_LIST="12.0"
export MAX_JOBS=8 # Limit parallelism to prevent OOM during compilation
export NVCC_FLAGS="-O3"
export VLLM_INSTALL_PUNICA_KERNELS=1
export VLLM_USE_FLASHINFER_MOE_FP4=1

3. Install PyTorch Nightly (Targeting CUDA 12.8)

We use --pre to get pre-release versions compatible with SM120

echo "Step 2: Installing PyTorch Nightly..."
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

4. Install FlashInfer from Source (Solving the 'missing header' error)

Recursive clone ensures 'cutlass' and 'cute' submodules are present

echo "Step 3: Building FlashInfer from Source..."
git clone --recursive https://github.com/flashinfer-ai/flashinfer.git /tmp/flashinfer
cd /tmp/flashinfer

Double check submodules

git submodule update --init --recursive
pip install -e.

5. Install vLLM from Source

echo "Step 4: Building vLLM from Source..."
git clone https://github.com/vllm-project/vllm.git /tmp/vllm
cd /tmp/vllm
pip install -r requirements.txt
pip install -e.

echo "Build complete. Verifying installation..."
python3 -c "import vllm; print(f'vLLM Version: {vllm.version}')"
python3 -c "import torch; print(f'CUDA Capability: {torch.cuda.get_device_capability()}')"

and i was able to load the MODEL!!!!!!!

(APIServer pid=344839) INFO 01-01 23:22:31 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=344839) INFO 01-01 23:22:31 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=344839) INFO: Started server process [344839]
(APIServer pid=344839) INFO: Waiting for application startup.
(APIServer pid=344839) INFO: Application startup complete.
(APIServer pid=344839) INFO: 192.168.10.1:56434 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:37734 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:37742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:37752 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=344839) INFO 01-01 23:26:52 [loggers.py:257] Engine 000: Avg prompt throughput: 97.4 tokens/s, Avg generation throughput: 51.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=344839) INFO: 192.168.10.1:37762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:37776 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:37784 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:35502 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=344839) INFO: 192.168.10.1:35516 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=344839) INFO 01-01 23:27:02 [loggers.py:257] Engine 000: Avg prompt throughput: 87.5 token

Ja6ek

Jan 2

Model is working now with NCCL_P2P_DISABLE=0, I had to change GRUB_CMDLINE_LINUX_DEFAULT.

Simple Python test:

SGLang:
Run 1: 512 tokens in 7.08s (72.27 tokens/s)
Run 2: 512 tokens in 6.89s (74.30 tokens/s)
Run 3: 512 tokens in 6.89s (74.35 tokens/s)
Run 4: 512 tokens in 6.88s (74.44 tokens/s)

vLLM:
Run 1: 512 tokens in 6.39s (80.07 tokens/s)
Run 2: 507 tokens in 6.13s (82.67 tokens/s)
Run 3: 512 tokens in 6.21s (82.39 tokens/s)
Run 4: 512 tokens in 6.22s (82.34 tokens/s)

Thanks,
Jarek

maleal

Jan 5

•

edited Jan 5

I don't have much luck. Still get this error with this config :

     image: vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77
     ...
      CUDA_VISIBLE_DEVICES: 0,1
      CUDA_DEVICE_ORDER: PCI_BUS_ID

      NCCL_P2P_DISABLE: 1
      NCCL_SHM_DISABLE: 0
      NCCL_CUMEM_ENABLE: 0
      NCCL_IB_DISABLE: 1
      NCCL_NVLS_ENABLE: 0
      NCCL_DEBUG: INFO
      NCCL_ALGO: Ring
      NCCL_PROTO: Simple
      NCCL_MIN_NCHANNELS: 8
      NCCL_MAX_NCHANNELS: 16

      VLLM_USE_V1: 1
      VLLM_LOGGING_LEVEL: DEBUG
      VLLM_NO_USAGE_STATS: 1
      VLLM_SLEEP_WHEN_IDLE: 1
      VLLM_TRACE_FUNCTION: 0
      VLLM_SKIP_P2P_CHECK: 0

      VLLM_USE_FLASHINFER_MOE_FP4: 1
      VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: 1
      VLLM_FLASHINFER_MOE_BACKEND: throughput
      VLLM_WORKER_MULTIPROC_METHOD: spawn

      SAFETENSORS_FAST_GPU: 1
      DO_NOT_TRACK: 1
      OMP_NUM_THREADS: 1

...

command:
      lukealonso/MiniMax-M2.1-NVFP4
      --host 0.0.0.0 
      --port 7890 
      --trust-remote-code
      --tensor-parallel-size 2
      --max-model-len 131072
      --max-num-batched-tokens 4096
      --max-num-seqs 16
      --swap-space 0
      --enable-expert-parallel
      --dtype auto 
      --enable-prefix-caching
      --enable-chunked-prefill       
      --enable-auto-tool-choice
      --tool-call-parser minimax_m2
      --served-model-name "MiniMax-M2.1"
      --reasoning-parser minimax_m2_append_think
      --gpu-memory-utilization 0.90
      --all2all-backend pplx
      --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"
      --attention-config.backend FLASHINFER

Error log:

[18/87] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fused_moe_120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_extensions/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -DCOMPILE_BLACKWELL_TMA_GEMMS -DCOMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS -DENABLE_BF16 -DENABLE_FP8 -DENABLE_FP4 -DUSING_OSS_CUTLASS_MOE_GEMM -gencode=arch=compute_120a,code=sm_120a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu -o fused_moe_120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cuda.o 
vllm-minimax-m2          | (EngineCore_DP0 pid=117) FAILED: [code=1] fused_moe_120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cuda.o 
vllm-minimax-m2          | (EngineCore_DP0 pid=117) /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fused_moe_120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_extensions/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -DCOMPILE_BLACKWELL_TMA_GEMMS -DCOMPILE_BLACKWELL_SM120_TMA_GROUPED_GEMMS -DENABLE_BF16 -DENABLE_FP8 -DENABLE_FP4 -DUSING_OSS_CUTLASS_MOE_GEMM -gencode=arch=compute_120a,code=sm_120a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu -o fused_moe_120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cuda.o 
vllm-minimax-m2          | (EngineCore_DP0 pid=117) In file included from /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl:39,
vllm-minimax-m2          | (EngineCore_DP0 pid=117)                  from /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/tensorrt_llm/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:1:
vllm-minimax-m2          | (EngineCore_DP0 pid=117) /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h:19:10: fatal error: cublasLt.h: No such file or directory

Seems almost like the image is missing something?
/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h:19:10: fatal error: cublasLt.h: No such file or directory

nmitchko

Jan 7

I don't have much luck. Still get this error with this config :
// ....
/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h:19:10: fatal error: cublasLt.h: No such file or directory

Yeah pain on my end as well....

jjaxkp

Jan 8

Recompile everything for your Blackwell card,., i think was also getting a cublas error too at one point.

Clean your cache,

flashinfer clear-cache
rm -rf ~/.cache/vllm
rm -rf ~/.cache/torch_extensions
rm -rf /home/$USER/.cache/flashinfer/

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

git clone --recursive https://github.com/flashinfer-ai/flashinfer.git ~/flashinfer
cd ~/flashinfer

git submodule update --init --recursive
pip install -e.

git clone https://github.com/vllm-project/vllm.git ~/vllm
cd ~/vllm
pip install -r requirements.txt
pip install -e.

Set your environment

export TORCH_CUDA_ARCH_LIST="12.0"
export MAX_JOBS=8
export NVCC_FLAGS="-O3"
export VLLM_INSTALL_PUNICA_KERNELS=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0

vllm serve lukealonso/MiniMax-M2.1-NVFP4 --host 0.0.0.0 --port 8000 --served-model-name minimax-m2.1 --trust-remote-code --gpu-memory-utilization 0.94 --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 196608 --max-num-seqs 64 --max-num-batched-tokens 32768 --dtype auto --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --compilation-config "{"cudagraph_mode": "PIECEWISE"}" --enable-expert-parallel --enable-prefix-caching --enable-chunked-prefill --attention-config.backend FLASHINFER --kv-cache-dtype fp8_e4m3 --calculate-kv-scales

The first time you compile your model it will take a little but it I literally just did these steps in a fresh environment and it worked. Hopefully this works for you, good luck !

I don't have much luck. Still get this error with this config :
// ....
/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h:19:10: fatal error: cublasLt.h: No such file or directory

Yeah pain on my end as well....

nmitchko

Jan 8

Recompile everything for your Blackwell card,., i think was also getting a cublas error too at one point.

Clean your cache,

flashinfer clear-cache
rm -rf ~/.cache/vllm
rm -rf ~/.cache/torch_extensions
rm -rf /home/$USER/.cache/flashinfer/

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

git clone --recursive https://github.com/flashinfer-ai/flashinfer.git ~/flashinfer
cd ~/flashinfer

git submodule update --init --recursive
pip install -e.

git clone https://github.com/vllm-project/vllm.git ~/vllm
cd ~/vllm
pip install -r requirements.txt
pip install -e.

Set your environment

export TORCH_CUDA_ARCH_LIST="12.0"
export MAX_JOBS=8
export NVCC_FLAGS="-O3"
export VLLM_INSTALL_PUNICA_KERNELS=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=8
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=16777216
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NVLS_ENABLE=0
export NCCL_SHM_DISABLE=0

vllm serve lukealonso/MiniMax-M2.1-NVFP4 --host 0.0.0.0 --port 8000 --served-model-name minimax-m2.1 --trust-remote-code --gpu-memory-utilization 0.94 --tensor-parallel-size 2 --pipeline-parallel-size 1 --max-model-len 196608 --max-num-seqs 64 --max-num-batched-tokens 32768 --dtype auto --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --all2all-backend pplx --compilation-config "{"cudagraph_mode": "PIECEWISE"}" --enable-expert-parallel --enable-prefix-caching --enable-chunked-prefill --attention-config.backend FLASHINFER --kv-cache-dtype fp8_e4m3 --calculate-kv-scales

The first time you compile your model it will take a little but it I literally just did these steps in a fresh environment and it worked. Hopefully this works for you, good luck !

I don't have much luck. Still get this error with this config :
// ....
/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h:19:10: fatal error: cublasLt.h: No such file or directory

Yeah pain on my end as well....

I'm using a kubernetes environment to run this. No vllm cache outside the container. I've rebuilt the container manually so we will see....

ktsaou

Jan 9

btw, I have opened this bug report to vllm: https://github.com/vllm-project/vllm/issues/31856

vllm 0.14.rc1-dev264 broke --enable-expert-parallel and the model is not usable anymore. vllm 0.14.rc1-dev263 works with expert parallel.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment