Running on a Blackwell 96 GB GPU (RTX 6000)

#1
by thommyb - opened
  • I use vLLM 0.10.2
  • I run podman on Rocky Linux 9

I tried to run this model on a PNY RTX 6000 Blackwell 96GB card. It consumes the entire memory and took a minute or so to load.
Before I used it in a TRX 4500 Ada card with only 24 GB vRAM. There it consumed 22 GB vRAM.
To me it looks like FP8 is not really supported on Blackwell chips. Is that possible??

I tried several configs. Here is the YAML file I pass as configuration currently running with float16 (not what I want).

# apertus8B_startupcfg.yaml
# all vllm parameters are allowed here, BUT replace '-' with '_' in parameter names!

model: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
tokenizer: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
trust_remote_code: true
tensor_parallel_size: 1
max_model_len: 8192
port: 8000
gpu_memory_utilization: 0.5
dtype: float16

This is using almost 50 GB due to my gpu_memory_utilization: 0.5 setting.
Here is the output from nvidia-smi

Sun Oct  5 00:47:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8              3W /  300W |   49994MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            7976      C   VLLM::EngineCore                      49984MiB |
+-----------------------------------------------------------------------------------------+

And this is how I start it in Podman:

#!/usr/bin/env bash

PORT=8001
MODEL_ID="RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic"
MODEL_NAME="apertus8b"
IMAGE_TAG="docker.io/vllm/vllm-openai:v0.10.2"
CONTAINER_NAME="vllm-${MODEL_NAME}"
CONFIG_FILE="apertus8B_startupcfg.yaml"


podman run \
    --name "$CONTAINER_NAME" \
    --detach \
    --rm \
    --volume ./hf_model_cache/:"/models/$MODEL_ID":Z \
    --volume ./containerlogs/:/logs:Z \
    --volume ./$CONFIG_FILE:/app/config.yaml:Z \
    --device nvidia.com/gpu=all \
    --entrypoint /bin/bash \
    "$IMAGE_TAG" \
    -c "exec python3 -m vllm.entrypoints.openai.api_server --config /app/config.yaml > /logs/startup.log 2>&1"

I am just developing software on top if this, I do not understand all those different quantization algorythms etc. Sorry for that...
What I like is to load several models into that 96 GB of vRAM the Blackwell card offers. And that a 8B model cosumes the wholw memory is a bit disappointing.

Sign up or log in to comment