OutOfMemory during weights loading (vLLM)
#1
by nephepritou - opened
Here is my llama-swap config:
minimax-m2.7-4bit:
env:
- VLLM_LOG_STATS_INTERVAL=5
- VLLM_MARLIN_USE_ATOMIC_ADD=1
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
- OMP_NUM_THREADS=4
- VIRTUAL_ENV=/home/gleb/llm/env_qwen_35
cmd: |
/home/gleb/.local/bin/uv run
-m vllm.entrypoints.openai.api_server
--model /mnt/data/llm-data/models/Lasimeri/MiniMax-M2.7-int4-AutoRound
#--quantization gptq_marlin
--served-model-name "minimax-m2.7-4bit"
--port ${PORT}
-tp 8
#--performance-mode interactivity
--enable-sleep-mode
--max-num-batched-tokens 8192
--enable-prefix-caching
--enable-chunked-prefill
#--kv-offloading-size 32
#--disable-hybrid-kv-cache-manager
--max-model-len auto
--gpu-memory-utilization 0.9
--max-num-seqs 2
--attention-backend flashinfer
#--kv-cache-dtype float16
--dtype half
--reasoning-parser minimax_m2
--enable-auto-tool-choice
--tool-call-parser minimax_m2
#--load-format instanttensor
--trust-remote-code
#--enable-expert-parallel
#--speculative-config '{"method":"suffix","num_speculative_tokens":16}'
With expert parallel it loading and running fine, but slow. Without expert parallel it crashes with OOM during weights loading. Running on 8x3090.
nephepritou changed discussion title from OutOfMemory to OutOfMemory during weights loading
nephepritou changed discussion title from OutOfMemory during weights loading to OutOfMemory during weights loading (vLLM)