Instructions to use RedHatAI/gemma-4-31B-it-FP8-block with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/gemma-4-31B-it-FP8-block with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="RedHatAI/gemma-4-31B-it-FP8-block")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-block")
model = AutoModelForImageTextToText.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-block")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/gemma-4-31B-it-FP8-block with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/gemma-4-31B-it-FP8-block"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-block

SGLang

How to use RedHatAI/gemma-4-31B-it-FP8-block with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/gemma-4-31B-it-FP8-block" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/gemma-4-31B-it-FP8-block" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-FP8-block",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/gemma-4-31B-it-FP8-block with Docker Model Runner:
```
docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-block
```

Problem on Nvidia DGX Spark

by akalongman - opened Apr 15

Discussion

akalongman

Apr 15

The pre-compiled FP8 kernels in the current stable vLLM release are fundamentally incompatible with specific combination of the Grace Blackwell (GB10) chip and the latest NVIDIA 580.142 driver. Errors:

(base) spark@spark-7549:~$ bin/logs.sh 

Tailing vLLM logs... (Press CTRL+C to exit)

Get:1 http://ports.ubuntu.com/ubuntu-ports jammy InRelease [270 kB]

Get:2 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]

Get:3 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main arm64 Packages [38.9 kB]

Get:4 http://ports.ubuntu.com/ubuntu-ports jammy-updates InRelease [128 kB]

Get:5 http://ports.ubuntu.com/ubuntu-ports jammy-backports InRelease [127 kB]

Get:6 http://ports.ubuntu.com/ubuntu-ports jammy-security InRelease [129 kB]

Get:7 http://ports.ubuntu.com/ubuntu-ports jammy/restricted arm64 Packages [24.2 kB]

Get:8 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 Packages [1758 kB]

Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa  InRelease [1579 B]

Get:10 http://ports.ubuntu.com/ubuntu-ports jammy/universe arm64 Packages [17.2 MB]

Get:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa  Packages [1995 kB]

Get:12 http://ports.ubuntu.com/ubuntu-ports jammy/multiverse arm64 Packages [224 kB]

Get:13 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 Packages [4019 kB]

Get:14 http://ports.ubuntu.com/ubuntu-ports jammy-updates/restricted arm64 Packages [7052 kB]

Get:15 http://ports.ubuntu.com/ubuntu-ports jammy-updates/universe arm64 Packages [1675 kB]

Get:16 http://ports.ubuntu.com/ubuntu-ports jammy-updates/multiverse arm64 Packages [47.7 kB]

Get:17 http://ports.ubuntu.com/ubuntu-ports jammy-backports/universe arm64 Packages [33.9 kB]

Get:18 http://ports.ubuntu.com/ubuntu-ports jammy-backports/main arm64 Packages [83.5 kB]

Get:19 http://ports.ubuntu.com/ubuntu-ports jammy-security/universe arm64 Packages [1373 kB]

Get:20 http://ports.ubuntu.com/ubuntu-ports jammy-security/multiverse arm64 Packages [41.2 kB]

Get:21 http://ports.ubuntu.com/ubuntu-ports jammy-security/main arm64 Packages [3697 kB]

Get:22 http://ports.ubuntu.com/ubuntu-ports jammy-security/restricted arm64 Packages [6844 kB]

Fetched 46.8 MB in 6s (7704 kB/s)

Reading package lists...

Reading package lists...

Building dependency tree...

Reading state information...

The following packages were automatically installed and are no longer required:

  libcusolver-12-9 libcusparse-12-9

Use 'apt autoremove' to remove them.

The following additional packages will be installed:

  git-man less libcbor0.8 liberror-perl libfido2-1 libxmuu1 openssh-client

  xauth

Suggested packages:

  gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui

  gitk gitweb git-cvs git-mediawiki git-svn keychain libpam-ssh monkeysphere

  ssh-askpass

The following NEW packages will be installed:

  git git-man less libcbor0.8 liberror-perl libfido2-1 libxmuu1 openssh-client

  xauth

0 upgraded, 9 newly installed, 0 to remove and 41 not upgraded.

Need to get 5346 kB of archives.

After this operation, 24.3 MB of additional disk space will be used.

Get:1 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 less arm64 590-1ubuntu0.22.04.3 [141 kB]

Get:2 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 libcbor0.8 arm64 0.8.0-2ubuntu1 [24.3 kB]

Get:3 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 libfido2-1 arm64 1.10.0-1 [81.8 kB]

Get:4 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 libxmuu1 arm64 2:1.1.3-3 [10.4 kB]

Get:5 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 openssh-client arm64 1:8.9p1-3ubuntu0.14 [860 kB]

Get:6 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 xauth arm64 1:1.1-1build2 [26.8 kB]

Get:7 http://ports.ubuntu.com/ubuntu-ports jammy/main arm64 liberror-perl all 0.17029-1 [26.5 kB]

Get:8 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 git-man all 1:2.34.1-1ubuntu1.17 [954 kB]

Get:9 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main arm64 git arm64 1:2.34.1-1ubuntu1.17 [3222 kB]

debconf: delaying package configuration, since apt-utils is not installed

Fetched 5346 kB in 2s (3022 kB/s)

Selecting previously unselected package less.

(Reading database ... 22535 files and directories currently installed.)

Preparing to unpack .../0-less_590-1ubuntu0.22.04.3_arm64.deb ...

Unpacking less (590-1ubuntu0.22.04.3) ...

Selecting previously unselected package libcbor0.8:arm64.

Preparing to unpack .../1-libcbor0.8_0.8.0-2ubuntu1_arm64.deb ...

Unpacking libcbor0.8:arm64 (0.8.0-2ubuntu1) ...

Selecting previously unselected package libfido2-1:arm64.

Preparing to unpack .../2-libfido2-1_1.10.0-1_arm64.deb ...

Unpacking libfido2-1:arm64 (1.10.0-1) ...

Selecting previously unselected package libxmuu1:arm64.

Preparing to unpack .../3-libxmuu1_2%3a1.1.3-3_arm64.deb ...

Unpacking libxmuu1:arm64 (2:1.1.3-3) ...

Selecting previously unselected package openssh-client.

Preparing to unpack .../4-openssh-client_1%3a8.9p1-3ubuntu0.14_arm64.deb ...

Unpacking openssh-client (1:8.9p1-3ubuntu0.14) ...

Selecting previously unselected package xauth.

Preparing to unpack .../5-xauth_1%3a1.1-1build2_arm64.deb ...

Unpacking xauth (1:1.1-1build2) ...

Selecting previously unselected package liberror-perl.

Preparing to unpack .../6-liberror-perl_0.17029-1_all.deb ...

Unpacking liberror-perl (0.17029-1) ...

Selecting previously unselected package git-man.

Preparing to unpack .../7-git-man_1%3a2.34.1-1ubuntu1.17_all.deb ...

Unpacking git-man (1:2.34.1-1ubuntu1.17) ...

Selecting previously unselected package git.

Preparing to unpack .../8-git_1%3a2.34.1-1ubuntu1.17_arm64.deb ...

Unpacking git (1:2.34.1-1ubuntu1.17) ...

Setting up libcbor0.8:arm64 (0.8.0-2ubuntu1) ...

Setting up less (590-1ubuntu0.22.04.3) ...

Setting up liberror-perl (0.17029-1) ...

Setting up git-man (1:2.34.1-1ubuntu1.17) ...

Setting up libfido2-1:arm64 (1.10.0-1) ...

Setting up libxmuu1:arm64 (2:1.1.3-3) ...

Setting up openssh-client (1:8.9p1-3ubuntu0.14) ...

update-alternatives: using /usr/bin/ssh to provide /usr/bin/rsh (rsh) in auto mode

update-alternatives: warning: skip creation of /usr/share/man/man1/rsh.1.gz because associated file /usr/share/man/man1/ssh.1.gz (of link group rsh) doesn't exist

update-alternatives: using /usr/bin/slogin to provide /usr/bin/rlogin (rlogin) in auto mode

update-alternatives: warning: skip creation of /usr/share/man/man1/rlogin.1.gz because associated file /usr/share/man/man1/slogin.1.gz (of link group rlogin) doesn't exist

update-alternatives: using /usr/bin/scp to provide /usr/bin/rcp (rcp) in auto mode

update-alternatives: warning: skip creation of /usr/share/man/man1/rcp.1.gz because associated file /usr/share/man/man1/scp.1.gz (of link group rcp) doesn't exist

Setting up git (1:2.34.1-1ubuntu1.17) ...

Setting up xauth (1:1.1-1build2) ...

Processing triggers for libc-bin (2.35-0ubuntu3.10) ...

Processing triggers for mailcap (3.70+nmu1ubuntu1) ...

Collecting git+https://github.com/huggingface/transformers.git

  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-ykjba1_y

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-ykjba1_y

  Resolved https://github.com/huggingface/transformers.git to commit 18aa0866577ce3270846d5d44a61534f636f9b42

  Installing build dependencies: started

  Installing build dependencies: finished with status 'done'

  Getting requirements to build wheel: started

  Getting requirements to build wheel: finished with status 'done'

  Preparing metadata (pyproject.toml): started

  Preparing metadata (pyproject.toml): finished with status 'done'

Collecting huggingface-hub<2.0,>=1.5.0 (from transformers==5.6.0.dev0)

  Downloading huggingface_hub-1.10.2-py3-none-any.whl.metadata (14 kB)

Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (2.2.6)

Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (26.0)

Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (6.0.3)

Requirement already satisfied: regex>=2025.10.22 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (2026.3.32)

Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (0.22.2)

Requirement already satisfied: typer in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (0.24.1)

Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (0.7.0)

Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==5.6.0.dev0) (4.67.3)

Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (3.25.2)

Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (2026.3.0)

Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (1.4.3)

Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (0.28.1)

Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (4.15.0)

Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (4.13.0)

Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (2026.2.25)

Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (1.0.9)

Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (3.11)

Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.5.0->transformers==5.6.0.dev0) (0.16.0)

Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.12/dist-packages (from typer->transformers==5.6.0.dev0) (8.3.1)

Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers==5.6.0.dev0) (1.5.4)

Requirement already satisfied: rich>=12.3.0 in /usr/local/lib/python3.12/dist-packages (from typer->transformers==5.6.0.dev0) (14.3.3)

Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from typer->transformers==5.6.0.dev0) (0.0.4)

Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers==5.6.0.dev0) (4.0.0)

Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=12.3.0->typer->transformers==5.6.0.dev0) (2.20.0)

Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer->transformers==5.6.0.dev0) (0.1.2)

Downloading huggingface_hub-1.10.2-py3-none-any.whl (642 kB)

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 642.6/642.6 kB 4.6 MB/s  0:00:00

Building wheels for collected packages: transformers

  Building wheel for transformers (pyproject.toml): started

  Building wheel for transformers (pyproject.toml): finished with status 'done'

  Created wheel for transformers: filename=transformers-5.6.0.dev0-py3-none-any.whl size=11425325 sha256=4deb440a2202168e59dd4605dcb9aa7b299f23a34556100769fb7476f6cc41c0

  Stored in directory: /tmp/pip-ephem-wheel-cache-339m4kby/wheels/54/cb/3f/83103de5575c534436d6a4686686dead458238dfaf1147e98d

Successfully built transformers

Installing collected packages: huggingface-hub, transformers

  Attempting uninstall: huggingface-hub

    Found existing installation: huggingface_hub 0.36.2

    Uninstalling huggingface_hub-0.36.2:

      Successfully uninstalled huggingface_hub-0.36.2

  Attempting uninstall: transformers

    Found existing installation: transformers 4.57.6

    Uninstalling transformers-4.57.6:

      Successfully uninstalled transformers-4.57.6



ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

vllm 0.19.0 requires transformers<5,>=4.56.0, but you have transformers 5.6.0.dev0 which is incompatible.

compressed-tensors 0.14.0.1 requires transformers<5.0.0, but you have transformers 5.6.0.dev0 which is incompatible.

Successfully installed huggingface-hub-1.10.2 transformers-5.6.0.dev0

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299] 

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299]        █     █     █▄   ▄█

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299]   █▄█▀ █     █     █     █  model   RedHatAI/gemma-4-31B-it-FP8-block

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:299] 

(APIServer pid=1) INFO 04-15 15:37:03 [utils.py:233] non-default args: {'model_tag': 'RedHatAI/gemma-4-31B-it-FP8-block', 'tool_call_parser': 'gemma4', 'model': 'RedHatAI/gemma-4-31B-it-FP8-block', 'max_model_len': 32768, 'quantization': 'compressed-tensors', 'enforce_eager': True, 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.85}

(APIServer pid=1) INFO 04-15 15:37:08 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration

(APIServer pid=1) INFO 04-15 15:37:08 [model.py:1678] Using max model len 32768

(APIServer pid=1) INFO 04-15 15:37:08 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.

(APIServer pid=1) INFO 04-15 15:37:08 [vllm.py:790] Asynchronous scheduling is enabled.

(APIServer pid=1) WARNING 04-15 15:37:08 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none

(APIServer pid=1) WARNING 04-15 15:37:08 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.

(APIServer pid=1) INFO 04-15 15:37:08 [vllm.py:1025] Cudagraph is disabled under eager mode

(APIServer pid=1) INFO 04-15 15:37:08 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant

(EngineCore pid=653) INFO 04-15 15:37:15 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='RedHatAI/gemma-4-31B-it-FP8-block', speculative_config=None, tokenizer='RedHatAI/gemma-4-31B-it-FP8-block', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=RedHatAI/gemma-4-31B-it-FP8-block, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}

(EngineCore pid=653) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:435: UserWarning: 

(EngineCore pid=653)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.

(EngineCore pid=653)     Minimum and Maximum cuda capability supported by this version of PyTorch is

(EngineCore pid=653)     (8.0) - (12.0)

(EngineCore pid=653)     

(EngineCore pid=653)   queued_call()

(EngineCore pid=653) INFO 04-15 15:37:18 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:45123 backend=nccl

(EngineCore pid=653) INFO 04-15 15:37:18 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A

(EngineCore pid=653) INFO 04-15 15:37:19 [gpu_model_runner.py:4735] Starting to load model RedHatAI/gemma-4-31B-it-FP8-block...

(EngineCore pid=653) INFO 04-15 15:37:19 [vllm.py:790] Asynchronous scheduling is enabled.

(EngineCore pid=653) WARNING 04-15 15:37:19 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none

(EngineCore pid=653) WARNING 04-15 15:37:19 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.

(EngineCore pid=653) INFO 04-15 15:37:19 [vllm.py:1025] Cudagraph is disabled under eager mode

(EngineCore pid=653) INFO 04-15 15:37:19 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant

(EngineCore pid=653) INFO 04-15 15:37:19 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

(EngineCore pid=653) INFO 04-15 15:37:19 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  50% Completed | 1/2 [02:49<02:49, 169.91s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [03:30<00:00, 94.00s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [03:30<00:00, 105.39s/it]

(EngineCore pid=653) 

(EngineCore pid=653) INFO 04-15 15:40:52 [default_loader.py:384] Loading weights took 210.97 seconds

(EngineCore pid=653) INFO 04-15 15:40:53 [gpu_model_runner.py:4820] Model loading took 31.48 GiB memory and 213.581265 seconds

(EngineCore pid=653) INFO 04-15 15:40:53 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 2496 tokens, and profiled with 1 video items of the maximum feature size.

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108] EngineCore failed to start.

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108] Traceback (most recent call last):

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     super().__init__(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     kv_cache_config = self._initialize_kv_caches(vllm_config)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     available_gpu_memory = self.model_executor.determine_available_memory()

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self.collective_rpc("determine_available_memory")

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     result = run_method(self.driver_worker, method, args, kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     self.model_runner.profile_run()

(EngineCore pid=653) Process EngineCore:

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     hidden_states, last_hidden_states = self._dummy_run(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                                         ^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return func(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     outputs = self.model(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]               ^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self._call_impl(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return forward_call(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mm.py", line 1276, in forward

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     hidden_states = self.language_model.model(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 452, in __call__

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self.forward(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 896, in forward

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     hidden_states, residual = layer(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                               ^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self._call_impl(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return forward_call(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 591, in forward

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     hidden_states = self.self_attn(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                     ^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self._call_impl(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return forward_call(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 394, in forward

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     qkv, _ = self.qkv_proj(hidden_states)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self._call_impl(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return forward_call(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 582, in forward

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     output_parallel = self.quant_method.apply(self, input_, bias)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 921, in apply

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return scheme.apply_weights(layer, x, bias=bias)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 201, in apply_weights

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self.w8a8_block_fp8_linear.apply(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 421, in apply

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     output = self.w8a8_blockscale_op(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 467, in _run_cutlass

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return cutlass_scaled_mm(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 69, in cutlass_scaled_mm

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return ops.cutlass_scaled_mm(

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 845, in cutlass_scaled_mm

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]     return self._op(*args, **kwargs)

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) ERROR 04-15 15:41:17 [core.py:1108] RuntimeError: Error Internal

(EngineCore pid=653) Traceback (most recent call last):

(EngineCore pid=653)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

(EngineCore pid=653)     self.run()

(EngineCore pid=653)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

(EngineCore pid=653)     self._target(*self._args, **self._kwargs)

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core

(EngineCore pid=653)     raise e

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core

(EngineCore pid=653)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)

(EngineCore pid=653)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(EngineCore pid=653)     return func(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__

(EngineCore pid=653)     super().__init__(

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__

(EngineCore pid=653)     kv_cache_config = self._initialize_kv_caches(vllm_config)

(EngineCore pid=653)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(EngineCore pid=653)     return func(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches

(EngineCore pid=653)     available_gpu_memory = self.model_executor.determine_available_memory()

(EngineCore pid=653)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory

(EngineCore pid=653)     return self.collective_rpc("determine_available_memory")

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc

(EngineCore pid=653)     result = run_method(self.driver_worker, method, args, kwargs)

(EngineCore pid=653)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method

(EngineCore pid=653)     return func(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context

(EngineCore pid=653)     return func(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory

(EngineCore pid=653)     self.model_runner.profile_run()

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run

(EngineCore pid=653)     hidden_states, last_hidden_states = self._dummy_run(

(EngineCore pid=653)                                         ^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context

(EngineCore pid=653)     return func(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run

(EngineCore pid=653)     outputs = self.model(

(EngineCore pid=653)               ^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653)     return self._call_impl(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653)     return forward_call(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mm.py", line 1276, in forward

(EngineCore pid=653)     hidden_states = self.language_model.model(

(EngineCore pid=653)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 452, in __call__

(EngineCore pid=653)     return self.forward(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 896, in forward

(EngineCore pid=653)     hidden_states, residual = layer(

(EngineCore pid=653)                               ^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653)     return self._call_impl(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653)     return forward_call(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 591, in forward

(EngineCore pid=653)     hidden_states = self.self_attn(

(EngineCore pid=653)                     ^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653)     return self._call_impl(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653)     return forward_call(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 394, in forward

(EngineCore pid=653)     qkv, _ = self.qkv_proj(hidden_states)

(EngineCore pid=653)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl

(EngineCore pid=653)     return self._call_impl(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl

(EngineCore pid=653)     return forward_call(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 582, in forward

(EngineCore pid=653)     output_parallel = self.quant_method.apply(self, input_, bias)

(EngineCore pid=653)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 921, in apply

(EngineCore pid=653)     return scheme.apply_weights(layer, x, bias=bias)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py", line 201, in apply_weights

(EngineCore pid=653)     return self.w8a8_block_fp8_linear.apply(

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 421, in apply

(EngineCore pid=653)     output = self.w8a8_blockscale_op(

(EngineCore pid=653)              ^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 467, in _run_cutlass

(EngineCore pid=653)     return cutlass_scaled_mm(

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 69, in cutlass_scaled_mm

(EngineCore pid=653)     return ops.cutlass_scaled_mm(

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 845, in cutlass_scaled_mm

(EngineCore pid=653)     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)

(EngineCore pid=653)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__

(EngineCore pid=653)     return self._op(*args, **kwargs)

(EngineCore pid=653)            ^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=653) RuntimeError: Error Internal

[rank0]:[W415 15:41:17.823239800 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

(APIServer pid=1) Traceback (most recent call last):

(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>

(APIServer pid=1)     sys.exit(main())

(APIServer pid=1)              ^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main

(APIServer pid=1)     args.dispatch_function(args)

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd

(APIServer pid=1)     uvloop.run(run_server(args))

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run

(APIServer pid=1)     return __asyncio.run(

(APIServer pid=1)            ^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run

(APIServer pid=1)     return runner.run(main)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

(APIServer pid=1)     return self._loop.run_until_complete(task)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper

(APIServer pid=1)     return await main

(APIServer pid=1)            ^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server

(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker

(APIServer pid=1)     async with build_async_engine_client(

(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

(APIServer pid=1)     return await anext(self.gen)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client

(APIServer pid=1)     async with build_async_engine_client_from_engine_args(

(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

(APIServer pid=1)     return await anext(self.gen)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args

(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(

(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config

(APIServer pid=1)     return cls(

(APIServer pid=1)            ^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__

(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(

(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(APIServer pid=1)     return func(*args, **kwargs)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client

(APIServer pid=1)     return AsyncMPClient(*client_args)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

(APIServer pid=1)     return func(*args, **kwargs)

(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 887, in __init__

(APIServer pid=1)     super().__init__(

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__

(APIServer pid=1)     with launch_core_engines(

(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__

(APIServer pid=1)     next(self.gen)

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines

(APIServer pid=1)     wait_for_engine_startup(

(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup

(APIServer pid=1)     raise RuntimeError(

(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

bdellabe

Red Hat AI org Apr 15

Hi @akalongman , please open as an issue in vllm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment