mratsim/MiniMax-M2.1-FP8-INT4-AWQ

vLLM Config

by tclausen - opened Jan 9

Jan 9

Hi,

thanks for the work!

Besides the config options already set in your example script, did you set anything else?
I am also trying to run this on dual RTX 6000 Blackwell, but with only what you provided It complains of lack of VRAM and wont start up.
I am using the latest vLLM docker image.
I added some other config options to make it work, but my experience with vLLM is very limited and I am just in the process of understanding the options, so I don't know if these make sense together.. but at least it will work for now.

I added: export VLLM_ATTENTION_BACKEND=FLASHINFER to you script and then added these via config file:

kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}'
async-scheduling: true
no-enable-prefix-caching: true
max-cudagraph-capture-size: 2048
max-num-batched-tokens: 8192

mratsim

Owner Jan 9

First question, are you dedicating the GPUs to vllm or do you run Xorg or Wayland on them?

In my case I run my display on the integrated GPUs. And the Nvidia driver (or Wayland?) reserves about 600MB on each. If I run a display manager on my Nvidia GPUs, about 1.2GB is used instead.

Then this is my setup:

VLLM_ENV=$(cat <<EOF
    --env PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 \
    --env VLLM_SLEEP_WHEN_IDLE=1
EOF
)

podman run --replace --detach \
    --name "${PODNAME}" \
    --device nvidia.com/gpu=all \
    --security-opt=label=disable \
    --network=host \
    --ipc=host \
    -v "${LOCAL_MODELS}":/workspace/local_models \
    -v "${HF_CACHE}":/root/.cache/huggingface \
    ${VLLM_ENV} \
        "${VLLM_CT}" \
            vllm serve "${MODEL}" \
                --trust-remote-code \
                --port "${VLLM_PORT}" \
                --tensor-parallel-size 2 \
                --max-num-seqs 64 \
                --max-cudagraph-capture-size 64 \
                --gpu-memory-utilization ${GPU_UTIL} \
                --served-model-name "${MODELNAME}" \
                --attention-backend flashinfer \
                --enable-auto-tool-choice \
                --tool-call-parser minimax_m2 \
                --reasoning-parser minimax_m2 \
                --override-generation-config "${SAMPLER_OVERRIDE}" \
                "$@"

Some comments on your options:

I would avoid fp8 KV-cache at the moment, with the quantization the model gets stuck in loop unless I add the non-standard repetition_penalty and frequency_penalty, I fear the KV-cache would need stronger sampling parameters.

compilation config, I've experimented with those in the past and didn't get anything except one hardware incompat with enable_fi_allreduce_fusion, here are my comments

# TensorRT allreduce fusion requires cvt instruction that seems to be only for Tesla class
# Also compiled fusion seems to be slower and one option seems incompatible with structured output, maybe async_tp
# Also it requires 10% more GPU memory
# COMPIL_FLAGS='{"pass_config":{"enable_sequence_parallelism":true,"enable_async_tp":true,"enable_fusion":true,"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'

I can't comment on fuse_allreduce_rms and eliminate_noops but I

async-scheduling: during July/August it was incompatible with structured outputs (forcing JSON output), so I'm not using it. Apparently it was solved in December though: https://github.com/vllm-project/vllm/pull/29821
no-enable-prefix-caching: you have to or you will kill your performance on multi-turn conversations. Otherwise you need to reprocess the whole conversation from the start and on dual RTX Pro 6000, it's about 5000 tok/s, if you are deep into a coding session with 150000 tokens it's 30s of processing.
max-cudagraph-capture-size: Unless you routinely submit 2048 short prompts, that's overkill, and even if you limit to say 10 (I use 64), those not scheduled will be queued. In terms of throughput, you can saturate 2x RTX Pro 6000 with about 10~11 prompts that require long generation.
No specific comment on max-num-batched-tokens, I used to use 10240 or 16384 but I reverted to default 2048 because I wanted latency over global throughput: https://docs.vllm.ai/en/stable/configuration/optimization/#chunked-prefill and TTFT with high max-num-batched-tokens wasn't really an issue when everything is in the prefill cache

I suggest you also use --max-num-seqs == --max-cudagraph-capture-size otherwise you'll would schedule a lot of requests in parallel but a large portion of them won't have cudagraph acceleration and suffer from Cuda kernel calls.

tclausen

Jan 14

•

edited Jan 14

Thank you for your time!

I double checked my config and took your advice and read more about the config options.
I have it running now.

Just to answer your question about the system. It is a headleass server, I use it only for llm inference as a headless server. Actually I use proxmox as a host, then an archlinux vm as a guest with PCIE passthrough. THis way I can do easy snapshots before I try new things and can one click roll back if I mess things up ;)

kind regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment