Working configuration for Nvidia Blackwell

by luismiguelsaez - opened Apr 13

•

Hi folks!

The working vLLM configuration posted by the author doesn't work for dual RTX 6000 Pro, so I'm leaving this here, which is what worked for me:

CUDA_VISIBLE_DEVICES=0,1 \
SAFETENSORS_FAST_GPU=1 \
NCCL_P2P_DISABLE=1 \
NCCL_DEBUG=INFO \
VLLM_LOGGING_LEVEL=INFO \
vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
--trust-remote-code \
--enable_expert_parallel \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--max-num-seqs 2

Hope it's useful for someone!

aaron-newsome

Apr 13

Looks similar to the vllm config I settled on. I'm also running 2x RTX PRO 6000 Blackwell. I found the performance to be slightly slower than 2.5 with similar setup, same hardware. See my thread posted here yesterday. Another user posted a nice sglang docker-compose that is BLAZING FAST.

luismiguelsaez

Apr 13

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

verygreen

Apr 14

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

did you do the bios fix to enable the cards to talk to each other via the pcie? if you don't that'w when you tend to have hung in init. See here: https://www.reddit.com/r/LocalLLaMA/comments/1on7kol/troubleshooting_multigpu_with_2_rtx_pro_6000/

sousekd

Apr 16

•

edited Apr 16

@luismiguelsaez

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

I was fighting the same issue - freeze during warm-up, until I removed all kind of NCCL variables (that worked well for Qwen3.5-120B-A10B).
Make sure to remove what is not needed, and it should work even without --disable-custom-all-reduce:

docker run --rm \
  --name minimax-m2.7 \
  --ipc=host \
  --shm-size=32g \
  --runtime nvidia \
  --gpus device=all \
  -p 8000:8000 \
  -v /mnt/hfhub:/root/.cache/huggingface/hub \
  -e OMP_NUM_THREADS=16 \
  -e SGLANG_ENABLE_SPEC_V2=True \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path lukealonso/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --sleep-on-idle \
    --enable-torch-compile \
    --reasoning-parser minimax \
    --tool-call-parser minimax-m2 \
    --tensor-parallel-size 2 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --mem-fraction-static 0.93 \
    --context-length 131072 \
    --max-running-requests 1 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --enable-pcie-oneshot-allreduce

luismiguelsaez

Apr 16

That's a great suggestion, thanks! But also ... maybe they are now auto-discovering the GPUs topology, which is unpredictable while using Docker image tags that change over time; that's why I like to use more specific tags pinned to the version 😄

sousekd

Apr 16

•

edited Apr 16

That's a great suggestion, thanks! But also ... maybe they are now auto-discovering the GPUs topology, which is unpredictable while using Docker image tags that change over time; that's why I like to use more specific tags pinned to the version 😄

Yes. Just reporting back what helped me :). The performance is great. Not as great as Qwen122B with MTP, but still good:

Metric	avg	min	max	p99	p90	p50	std
Time to First Token (ms)	620.80	148.28	1,149.04	1,114.62	989.63	624.48	273.44
Time to Second Token (ms)	6.78	5.60	7.29	7.28	7.18	6.97	0.43
Time to First Output Token (ms)	3,086.46	1,464.16	7,021.31	6,642.71	5,361.52	2,714.12	1,437.47
Request Latency (ms)	22,294.26	15,178.43	29,996.89	29,834.95	28,472.66	22,197.37	4,540.13
Inter Token Latency (ms)	10.59	7.34	14.09	14.03	13.43	10.54	2.09
Output Token Throughput Per User (tokens/sec/user)	98.34	70.96	136.19	135.81	128.59	94.90	20.05
Output Sequence Length (tokens)	2,048.00	2,048.00	2,048.00	2,048.00	2,048.00	2,048.00	0.00
Input Sequence Length (tokens)	39,487.87	1,062.00	81,200.00	80,325.07	72,627.90	39,000.00	24,454.61
Output Token Throughput (tokens/sec)	88.01	N/A	N/A	N/A	N/A	N/A	N/A
Request Throughput (requests/sec)	0.04	N/A	N/A	N/A	N/A	N/A	N/A
Request Count (requests)	30.00	N/A	N/A	N/A	N/A	N/A	N/A

aiperf profile --model 'minimax-m2.7-nvfp4-sgl-128k-p1' --tokenizer 'MiniMaxAI/MiniMax-M2.7' --tokenizer-trust-remote-code --url 
'http://localhost:8080' --endpoint-type 'chat' --endpoint '/v1/chat/completions' --streaming --concurrency 1 --conversation-num 1 --conversation-turn-mean 30 
--conversation-turn-stddev 0 --conversation-turn-delay-mean 1000 --conversation-turn-delay-stddev 0 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 
0 --output-tokens-mean 2048 --num-dataset-entries 30 --warmup-request-count 1 --random-seed 42 --connection-reuse-strategy 'sticky-user-sessions' --extra-inputs 
'min_tokens:2048' --use-legacy-max-tokens --use-server-token-count

(2x RTX PRO 6000 @ 450W)

luismiguelsaez

Apr 16

Looks like a solid performance, specially compared to Qwen3.5 122b ( faster ) which is a model way less intelligent, at least according to my real-life usage tests.

Btw, didn't know about that aiperf tool, looks great.

fanhed

Apr 20

•

edited Apr 20

@luismiguelsaez

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

I was fighting the same issue - freeze during warm-up, until I removed all kind of NCCL variables (that worked well for Qwen3.5-120B-A10B).
Make sure to remove what is not needed, and it should work even without --disable-custom-all-reduce:
docker run --rm \
  --name minimax-m2.7 \
  --ipc=host \
  --shm-size=32g \
  --runtime nvidia \
  --gpus device=all \
  -p 8000:8000 \
  -v /mnt/hfhub:/root/.cache/huggingface/hub \
  -e OMP_NUM_THREADS=16 \
  -e SGLANG_ENABLE_SPEC_V2=True \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path lukealonso/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --sleep-on-idle \
    --enable-torch-compile \
    --reasoning-parser minimax \
    --tool-call-parser minimax-m2 \
    --tensor-parallel-size 2 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --mem-fraction-static 0.93 \
    --context-length 131072 \
    --max-running-requests 1 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --enable-pcie-oneshot-allreduce

After my test, it's best to use the --reasoning-parser minimax-append-think option; otherwise, it may affect performance and may frequently cause typos.

luismiguelsaez

Apr 21

Tried it and ended up with it stuck with 100% GPU usage after the usual cursed log line:

 [2026-04-21 12:31:28 TP0] sglang is using nccl==2.29.7

Same result after adding --disable-custom-all-reduce. Will continue testing with this model, thanks for the suggestion 😊

fanhed

Apr 21

Tried it and ended up with it stuck with 100% GPU usage after the usual cursed log line:
 [2026-04-21 12:31:28 TP0] sglang is using nccl==2.29.7
Same result after adding --disable-custom-all-reduce. Will continue testing with this model, thanks for the suggestion 😊

Which agent are you using?

Try this prompt: What does this url say: https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4

There is a certain probability of spelling errors, which makes me hesitant to use this model in a production environment.

sousekd

Apr 21

There is a certain probability of spelling errors, which makes me hesitant to use this model in a production environment.

Interesting. Did you try to play with things like temperature etc?

luismiguelsaez

Apr 23

Tried it and ended up with it stuck with 100% GPU usage after the usual cursed log line:
 [2026-04-21 12:31:28 TP0] sglang is using nccl==2.29.7
Same result after adding --disable-custom-all-reduce. Will continue testing with this model, thanks for the suggestion 😊
Which agent are you using?

Try this prompt: What does this url say: https://huggingface.co/lukealonso/MiniMax-M2.7-NVFP4

There is a certain probability of spelling errors, which makes me hesitant to use this model in a production environment.

I'm using Hermes and Opencode, but it's not even starting up as it's stuck during initialization due to NCCL, so I guess it doesn't matter much 😁

eddy1111111

Apr 23

--kv-cache-dtype bf16 \ ＆ --enable-pcie-oneshot-allreduce is mix use this model? i use disable the type

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment