robgreenberg3 commited on
Commit
f3b115b
·
verified ·
1 Parent(s): a13443e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md CHANGED
@@ -126,6 +126,142 @@ print(generated_text)
126
 
127
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ## Creation
130
 
131
  <details>
 
126
 
127
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
128
 
129
+ <details>
130
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
131
+
132
+ ```bash
133
+ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
134
+ --ipc=host \
135
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
136
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
137
+ --name=vllm \
138
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
139
+ vllm serve \
140
+ --tensor-parallel-size 8 \
141
+ --max-model-len 32768 \
142
+ --enforce-eager --model RedHatAI/Qwen3-8B-FP8-dynamic
143
+ ```
144
+ </details>
145
+
146
+ <details>
147
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
148
+
149
+ ```python
150
+ # Setting up vllm server with ServingRuntime
151
+ # Save as: vllm-servingruntime.yaml
152
+ apiVersion: serving.kserve.io/v1alpha1
153
+ kind: ServingRuntime
154
+ metadata:
155
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
156
+ annotations:
157
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
158
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
159
+ labels:
160
+ opendatahub.io/dashboard: 'true'
161
+ spec:
162
+ annotations:
163
+ prometheus.io/port: '8080'
164
+ prometheus.io/path: '/metrics'
165
+ multiModel: false
166
+ supportedModelFormats:
167
+ - autoSelect: true
168
+ name: vLLM
169
+ containers:
170
+ - name: kserve-container
171
+ image: quay.io/modh/vllm:rhoai-2.24-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.24-rocm
172
+ command:
173
+ - python
174
+ - -m
175
+ - vllm.entrypoints.openai.api_server
176
+ args:
177
+ - "--port=8080"
178
+ - "--model=/mnt/models"
179
+ - "--served-model-name={{.Name}}"
180
+ env:
181
+ - name: HF_HOME
182
+ value: /tmp/hf_home
183
+ ports:
184
+ - containerPort: 8080
185
+ protocol: TCP
186
+ ```
187
+
188
+ ```python
189
+ # Attach model to vllm server. This is an NVIDIA template
190
+ # Save as: inferenceservice.yaml
191
+ apiVersion: serving.kserve.io/v1beta1
192
+ kind: InferenceService
193
+ metadata:
194
+ annotations:
195
+ openshift.io/display-name: Qwen3-8B-FP8-dynamic # OPTIONAL CHANGE
196
+ serving.kserve.io/deploymentMode: RawDeployment
197
+ name: Qwen3-8B-FP8-dynamic # specify model name. This value will be used to invoke the model in the payload
198
+ labels:
199
+ opendatahub.io/dashboard: 'true'
200
+ spec:
201
+ predictor:
202
+ maxReplicas: 1
203
+ minReplicas: 1
204
+ model:
205
+ modelFormat:
206
+ name: vLLM
207
+ name: ''
208
+ resources:
209
+ limits:
210
+ cpu: '2' # this is model specific
211
+ memory: 8Gi # this is model specific
212
+ nvidia.com/gpu: '1' # this is accelerator specific
213
+ requests: # same comment for this block
214
+ cpu: '1'
215
+ memory: 4Gi
216
+ nvidia.com/gpu: '1'
217
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
218
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-qwen3-8b-fp8-dynamic:1.5
219
+ tolerations:
220
+ - effect: NoSchedule
221
+ key: nvidia.com/gpu
222
+ operator: Exists
223
+ ```
224
+
225
+ ```bash
226
+ # make sure first to be in the project where you want to deploy the model
227
+ # oc project <project-name>
228
+
229
+ # apply both resources to run model
230
+
231
+ # Apply the ServingRuntime
232
+ oc apply -f vllm-servingruntime.yaml
233
+
234
+ # Apply the InferenceService
235
+ oc apply -f qwen-inferenceservice.yaml
236
+ ```
237
+
238
+ ```python
239
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
240
+ # - Run `oc get inferenceservice` to find your URL if unsure.
241
+
242
+ # Call the server using curl:
243
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
244
+ -H "Content-Type: application/json" \
245
+ -d '{
246
+ "model": "Qwen3-8B-FP8-dynamic",
247
+ "stream": true,
248
+ "stream_options": {
249
+ "include_usage": true
250
+ },
251
+ "max_tokens": 1,
252
+ "messages": [
253
+ {
254
+ "role": "user",
255
+ "content": "How can a bee fly when its wings are so small?"
256
+ }
257
+ ]
258
+ }'
259
+
260
+ ```
261
+
262
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
263
+ </details>
264
+
265
  ## Creation
266
 
267
  <details>