robgreenberg3 commited on
Commit
796df36
·
verified ·
1 Parent(s): d30b3c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -1
README.md CHANGED
@@ -35,9 +35,15 @@ tasks:
35
  - text-to-text
36
  provider: RedHatAI
37
  license_link: https://www.apache.org/licenses/LICENSE-2.0
 
 
 
38
  ---
39
 
40
- # Voxtral-Mini-3B-2507-FP8-dynamic
 
 
 
41
 
42
  ## Model Overview
43
  - **Model Architecture:** VoxtralForConditionalGeneration
@@ -193,6 +199,141 @@ print(response)
193
  ```
194
  </details>
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  ## Creation
197
 
198
  This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
 
35
  - text-to-text
36
  provider: RedHatAI
37
  license_link: https://www.apache.org/licenses/LICENSE-2.0
38
+ validated_on:
39
+ - RHOAI 2.25
40
+ - RHAIIS 3.2.2
41
  ---
42
 
43
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
44
+ Voxtral-Mini-3B-2507-FP8-dynamic
45
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
46
+ </h1>
47
 
48
  ## Model Overview
49
  - **Model Architecture:** VoxtralForConditionalGeneration
 
199
  ```
200
  </details>
201
 
202
+ <details>
203
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
204
+
205
+ ```bash
206
+ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
207
+ --ipc=host \
208
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
209
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
210
+ --name=vllm \
211
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
212
+ vllm serve \
213
+ --tensor-parallel-size 8 \
214
+ --max-model-len 32768 \
215
+ --enforce-eager --model RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic
216
+ ```
217
+ </details>
218
+
219
+
220
+ <details>
221
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
222
+
223
+ ```python
224
+ # Setting up vllm server with ServingRuntime
225
+ # Save as: vllm-servingruntime.yaml
226
+ apiVersion: serving.kserve.io/v1alpha1
227
+ kind: ServingRuntime
228
+ metadata:
229
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
230
+ annotations:
231
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
232
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
233
+ labels:
234
+ opendatahub.io/dashboard: 'true'
235
+ spec:
236
+ annotations:
237
+ prometheus.io/port: '8080'
238
+ prometheus.io/path: '/metrics'
239
+ multiModel: false
240
+ supportedModelFormats:
241
+ - autoSelect: true
242
+ name: vLLM
243
+ containers:
244
+ - name: kserve-container
245
+ image: quay.io/modh/vllm:rhoai-2.25-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.25-rocm
246
+ command:
247
+ - python
248
+ - -m
249
+ - vllm.entrypoints.openai.api_server
250
+ args:
251
+ - "--port=8080"
252
+ - "--model=/mnt/models"
253
+ - "--served-model-name={{.Name}}"
254
+ env:
255
+ - name: HF_HOME
256
+ value: /tmp/hf_home
257
+ ports:
258
+ - containerPort: 8080
259
+ protocol: TCP
260
+ ```
261
+
262
+ ```python
263
+ # Attach model to vllm server. This is an NVIDIA template
264
+ # Save as: inferenceservice.yaml
265
+ apiVersion: serving.kserve.io/v1beta1
266
+ kind: InferenceService
267
+ metadata:
268
+ annotations:
269
+ openshift.io/display-name: Voxtral-Mini-3B-2507-FP8-dynamic # OPTIONAL CHANGE
270
+ serving.kserve.io/deploymentMode: RawDeployment
271
+ name: Voxtral-Mini-3B-2507-FP8-dynamic # specify model name. This value will be used to invoke the model in the payload
272
+ labels:
273
+ opendatahub.io/dashboard: 'true'
274
+ spec:
275
+ predictor:
276
+ maxReplicas: 1
277
+ minReplicas: 1
278
+ model:
279
+ modelFormat:
280
+ name: vLLM
281
+ name: ''
282
+ resources:
283
+ limits:
284
+ cpu: '2' # this is model specific
285
+ memory: 8Gi # this is model specific
286
+ nvidia.com/gpu: '1' # this is accelerator specific
287
+ requests: # same comment for this block
288
+ cpu: '1'
289
+ memory: 4Gi
290
+ nvidia.com/gpu: '1'
291
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
292
+ storageUri: oci://registry.redhat.io/rhelai1/voxtral-mini-3b-2507-fp8-dynamic:1.5-1756955008@sha256:168439c3c83832b48d1aa6652cb207c55cfc6bdf6bbe2cf512992c7e50f357be
293
+ tolerations:
294
+ - effect: NoSchedule
295
+ key: nvidia.com/gpu
296
+ operator: Exists
297
+ ```
298
+
299
+ ```bash
300
+ # make sure first to be in the project where you want to deploy the model
301
+ # oc project <project-name>
302
+
303
+ # apply both resources to run model
304
+
305
+ # Apply the ServingRuntime
306
+ oc apply -f vllm-servingruntime.yaml
307
+
308
+ ```
309
+
310
+ ```python
311
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
312
+ # - Run `oc get inferenceservice` to find your URL if unsure.
313
+
314
+ # Call the server using curl:
315
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
316
+ -H "Content-Type: application/json" \
317
+ -d '{
318
+ "model": "Voxtral-Mini-3B-2507-FP8-dynamic",
319
+ "stream": true,
320
+ "stream_options": {
321
+ "include_usage": true
322
+ },
323
+ "max_tokens": 1,
324
+ "messages": [
325
+ {
326
+ "role": "user",
327
+ "content": "How can a bee fly when its wings are so small?"
328
+ }
329
+ ]
330
+ }'
331
+
332
+ ```
333
+
334
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
335
+ </details>
336
+
337
  ## Creation
338
 
339
  This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.