Instructions to use RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic")
model = AutoModelForCausalLM.from_pretrained("RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic

SGLang

How to use RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic with Docker Model Runner:
```
docker model run hf.co/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
```

Running on a Blackwell 96 GB GPU (RTX 6000)

by thommyb - opened Oct 4, 2025

Discussion

thommyb

Oct 4, 2025

I use vLLM 0.10.2
I run podman on Rocky Linux 9

I tried to run this model on a PNY RTX 6000 Blackwell 96GB card. It consumes the entire memory and took a minute or so to load.
Before I used it in a TRX 4500 Ada card with only 24 GB vRAM. There it consumed 22 GB vRAM.
To me it looks like FP8 is not really supported on Blackwell chips. Is that possible??

I tried several configs. Here is the YAML file I pass as configuration currently running with float16 (not what I want).

# apertus8B_startupcfg.yaml
# all vllm parameters are allowed here, BUT replace '-' with '_' in parameter names!

model: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
tokenizer: /models/RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic
trust_remote_code: true
tensor_parallel_size: 1
max_model_len: 8192
port: 8000
gpu_memory_utilization: 0.5
dtype: float16

This is using almost 50 GB due to my gpu_memory_utilization: 0.5 setting.
Here is the output from nvidia-smi

Sun Oct  5 00:47:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8              3W /  300W |   49994MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            7976      C   VLLM::EngineCore                      49984MiB |
+-----------------------------------------------------------------------------------------+

And this is how I start it in Podman:

#!/usr/bin/env bash

PORT=8001
MODEL_ID="RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic"
MODEL_NAME="apertus8b"
IMAGE_TAG="docker.io/vllm/vllm-openai:v0.10.2"
CONTAINER_NAME="vllm-${MODEL_NAME}"
CONFIG_FILE="apertus8B_startupcfg.yaml"


podman run \
    --name "$CONTAINER_NAME" \
    --detach \
    --rm \
    --volume ./hf_model_cache/:"/models/$MODEL_ID":Z \
    --volume ./containerlogs/:/logs:Z \
    --volume ./$CONFIG_FILE:/app/config.yaml:Z \
    --device nvidia.com/gpu=all \
    --entrypoint /bin/bash \
    "$IMAGE_TAG" \
    -c "exec python3 -m vllm.entrypoints.openai.api_server --config /app/config.yaml > /logs/startup.log 2>&1"

I am just developing software on top if this, I do not understand all those different quantization algorythms etc. Sorry for that...
What I like is to load several models into that 96 GB of vRAM the Blackwell card offers. And that a 8B model cosumes the wholw memory is a bit disappointing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment