Current Nightly Build is Unavailable + Other VLLM issues. Working docker config as of 1/21/2026
#3
by
bigstorm
- opened
No luck with the config in the readme. The listed nightly no longer exists. I tried the latest (v0.14) docker container without luck, I also tried my most recent nightly build at the time (vllm/vllm-openai:nightly-b4f64e5b02a949c8856c9f81990b77ca56296cdc), no dice.
Mostly failing due to cublasLt.h compilation errors .
Disabling flashinfer and using marlin resulted in GEMM errors.Failed to initialize GEMM: status=7 workspace_size=168448 num_experts=128 M=131072 N=3072 K=3072
Removing --enable-expert-parallel and --all2all-backend pplx resolved those issues.
My current working docker container for 2x RTX 6000 Pros is:
services:
inference:
image: vllm/vllm-openai:nightly-b4f64e5b02a949c8856c9f81990b77ca56296cdc
container_name: inference
ports:
- "0.0.0.0:8000:8000"
shm_size: "32g"
ipc: "host"
ulimits:
memlock: -1
nofile: 1048576
environment:
- NCCL_IB_DISABLE=1
- NCCL_NVLS_ENABLE=0
- NCCL_P2P_DISABLE=0
- NCCL_SHM_DISABLE=0
- VLLM_USE_V1=1
- VLLM_USE_FLASHINFER_MOE_FP4=0
- VLLM_MXFP4_USE_MARLIN=1
- OMP_NUM_THREADS=8
- SAFETENSORS_FAST_GPU=1
volumes:
- /dev/shm:/dev/shm
- /hdd_nas:/hdd_nas
command:
- /hdd_nas/models/lukealonso-MiniMax-M2.1-NVFP4
- --enable-auto-tool-choice
- --tool-call-parser
- minimax_m2
- --reasoning-parser
- minimax_m2_append_think
- --enable-prefix-caching
- --enable-chunked-prefill
- --served-model-name
- "MiniMax-M2.1"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "16384"
- --dtype
- "auto"
- --max-num-seqs
- "16"
- --kv-cache-dtype
- fp8
- --host
- "0.0.0.0"
- --port
- "8000"
- --trust_remote_code
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: always
Enjoy responsibly.