Running on a Blackwell 96 GB GPU (RTX 6000)
- I use vLLM 0.10.2
- I run podman on Rocky Linux 9
I tried to run this model on a PNY RTX 6000 Blackwell 96GB card. It consumes the entire memory and took a minute or so to load.
Before I used it in a TRX 4500 Ada card with only 24 GB vRAM. There it consumed 22 GB vRAM.
To me it looks like FP8 is not really supported on Blackwell chips. Is that possible??
I tried several configs. Here is the YAML file I pass as configuration currently running with float16 (not what I want).
# apertus8B_startupcfg.yaml
# all vllm parameters are allowed here, BUT replace '-' with '_' in parameter names!
model: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
tokenizer: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
trust_remote_code: true
tensor_parallel_size: 1
max_model_len: 8192
port: 8000
gpu_memory_utilization: 0.5
dtype: float16
This is using almost 50 GB due to my gpu_memory_utilization: 0.5 setting.
Here is the output from nvidia-smi
Sun Oct 5 00:47:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:01:00.0 Off | Off |
| 30% 35C P8 3W / 300W | 49994MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 7976 C VLLM::EngineCore 49984MiB |
+-----------------------------------------------------------------------------------------+
And this is how I start it in Podman:
#!/usr/bin/env bash
PORT=8001
MODEL_ID="RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic"
MODEL_NAME="apertus8b"
IMAGE_TAG="docker.io/vllm/vllm-openai:v0.10.2"
CONTAINER_NAME="vllm-${MODEL_NAME}"
CONFIG_FILE="apertus8B_startupcfg.yaml"
podman run \
--name "$CONTAINER_NAME" \
--detach \
--rm \
--volume ./hf_model_cache/:"/models/$MODEL_ID":Z \
--volume ./containerlogs/:/logs:Z \
--volume ./$CONFIG_FILE:/app/config.yaml:Z \
--device nvidia.com/gpu=all \
--entrypoint /bin/bash \
"$IMAGE_TAG" \
-c "exec python3 -m vllm.entrypoints.openai.api_server --config /app/config.yaml > /logs/startup.log 2>&1"
I am just developing software on top if this, I do not understand all those different quantization algorythms etc. Sorry for that...
What I like is to load several models into that 96 GB of vRAM the Blackwell card offers. And that a 8B model cosumes the wholw memory is a bit disappointing.