RedHatAI
/

Voxtral-Mini-3B-2507-FP8-dynamic

@@ -35,9 +35,15 @@ tasks:
 - text-to-text
 provider: RedHatAI
 license_link: https://www.apache.org/licenses/LICENSE-2.0
 ---
-# Voxtral-Mini-3B-2507-FP8-dynamic
 ## Model Overview
 - **Model Architecture:** VoxtralForConditionalGeneration
@@ -193,6 +199,141 @@ print(response)
 ```
 </details>
 ## Creation
 This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.

 - text-to-text
 provider: RedHatAI
 license_link: https://www.apache.org/licenses/LICENSE-2.0
+validated_on:
+  - RHOAI 2.25
+  - RHAIIS 3.2.2
 ---
+<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
+  Voxtral-Mini-3B-2507-FP8-dynamic
+  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
+</h1>
 ## Model Overview
 - **Model Architecture:** VoxtralForConditionalGeneration
 ```
 </details>
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 8 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic
+```
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
+```python
+# Setting up vllm server with ServingRuntime
+# Save as: vllm-servingruntime.yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
+ annotations:
+   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
+   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
+ labels:
+   opendatahub.io/dashboard: 'true'
+spec:
+ annotations:
+   prometheus.io/port: '8080'
+   prometheus.io/path: '/metrics'
+ multiModel: false
+ supportedModelFormats:
+   - autoSelect: true
+     name: vLLM
+ containers:
+   - name: kserve-container
+     image: quay.io/modh/vllm:rhoai-2.25-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.25-rocm
+     command:
+       - python
+       - -m
+       - vllm.entrypoints.openai.api_server
+     args:
+       - "--port=8080"
+       - "--model=/mnt/models"
+       - "--served-model-name={{.Name}}"
+     env:
+       - name: HF_HOME
+         value: /tmp/hf_home
+     ports:
+       - containerPort: 8080
+         protocol: TCP
+```
+```python
+# Attach model to vllm server. This is an NVIDIA template
+# Save as: inferenceservice.yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    openshift.io/display-name: Voxtral-Mini-3B-2507-FP8-dynamic # OPTIONAL CHANGE
+    serving.kserve.io/deploymentMode: RawDeployment
+  name: Voxtral-Mini-3B-2507-FP8-dynamic         # specify model name. This value will be used to invoke the model in the payload
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources:
+        limits:
+          cpu: '2'			# this is model specific
+          memory: 8Gi		# this is model specific
+          nvidia.com/gpu: '1'	# this is accelerator specific
+        requests:			# same comment for this block
+          cpu: '1'
+          memory: 4Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
+      storageUri: oci://registry.redhat.io/rhelai1/voxtral-mini-3b-2507-fp8-dynamic:1.5-1756955008@sha256:168439c3c83832b48d1aa6652cb207c55cfc6bdf6bbe2cf512992c7e50f357be
+    tolerations:
+    - effect: NoSchedule
+      key: nvidia.com/gpu
+      operator: Exists
+```
+```bash
+# make sure first to be in the project where you want to deploy the model
+# oc project <project-name>
+# apply both resources to run model
+# Apply the ServingRuntime
+oc apply -f vllm-servingruntime.yaml
+```
+```python
+# Replace <inference-service-name> and <cluster-ingress-domain> below:
+# - Run `oc get inferenceservice` to find your URL if unsure.
+# Call the server using curl:
+curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
+        -H "Content-Type: application/json" \
+        -d '{
+    "model": "Voxtral-Mini-3B-2507-FP8-dynamic",
+    "stream": true,
+    "stream_options": {
+        "include_usage": true
+    },
+    "max_tokens": 1,
+    "messages": [
+        {
+            "role": "user",
+            "content": "How can a bee fly when its wings are so small?"
+        }
+    ]
+}'
+```
+See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
+</details>
 ## Creation
 This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.