robgreenberg3 commited on
Commit
287ddf5
·
verified ·
1 Parent(s): aca9664

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -5
README.md CHANGED
@@ -21,13 +21,21 @@ tasks:
21
  - text-generation
22
  provider: OpenAI
23
  license_link: https://www.apache.org/licenses/LICENSE-2.0
 
 
 
24
  ---
25
 
26
- <p align="center">
27
- <img alt="gpt-oss-20b" src="https://raw.githubusercontent.com/openai/gpt-oss/main/docs/gpt-oss-20b.svg">
28
- </p>
 
 
 
 
 
29
 
30
- <p align="center">
31
  <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> ·
32
  <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> ·
33
  <a href="https://arxiv.org/abs/2508.10925"><strong>Model card</strong></a> ·
@@ -46,7 +54,7 @@ Both models were trained on our [harmony response format](https://github.com/ope
46
 
47
 
48
  > [!NOTE]
49
- > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) for the larger model.
50
 
51
  # Highlights
52
 
@@ -121,6 +129,142 @@ vllm serve openai/gpt-oss-20b
121
 
122
  [Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ## PyTorch / Triton
125
 
126
  To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation).
 
21
  - text-generation
22
  provider: OpenAI
23
  license_link: https://www.apache.org/licenses/LICENSE-2.0
24
+ validated_on:
25
+ - RHOAI 2.25
26
+ - RHAIIS 3.2.2
27
  ---
28
 
29
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
30
+ gpt-oss-20b
31
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
32
+ </h1>
33
+
34
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
35
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
36
+ </a>
37
 
38
+ <p>
39
  <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> ·
40
  <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> ·
41
  <a href="https://arxiv.org/abs/2508.10925"><strong>Model card</strong></a> ·
 
54
 
55
 
56
  > [!NOTE]
57
+ > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out [`gpt-oss-120b`](https://huggingface.co/RedHatAI/gpt-oss-120b) for the larger model.
58
 
59
  # Highlights
60
 
 
129
 
130
  [Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)
131
 
132
+ <details>
133
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
134
+
135
+ ```bash
136
+ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
137
+ --ipc=host \
138
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
139
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
140
+ --name=vllm \
141
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
142
+ vllm serve \
143
+ --tensor-parallel-size 8 \
144
+ --max-model-len 32768 \
145
+ --enforce-eager --model RedHatAI/gpt-oss-20b
146
+ ```
147
+ </details>
148
+
149
+
150
+ <details>
151
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
152
+
153
+ ```python
154
+ # Setting up vllm server with ServingRuntime
155
+ # Save as: vllm-servingruntime.yaml
156
+ apiVersion: serving.kserve.io/v1alpha1
157
+ kind: ServingRuntime
158
+ metadata:
159
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
160
+ annotations:
161
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
162
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
163
+ labels:
164
+ opendatahub.io/dashboard: 'true'
165
+ spec:
166
+ annotations:
167
+ prometheus.io/port: '8080'
168
+ prometheus.io/path: '/metrics'
169
+ multiModel: false
170
+ supportedModelFormats:
171
+ - autoSelect: true
172
+ name: vLLM
173
+ containers:
174
+ - name: kserve-container
175
+ image: quay.io/modh/vllm:rhoai-2.24-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.24-rocm
176
+ command:
177
+ - python
178
+ - -m
179
+ - vllm.entrypoints.openai.api_server
180
+ args:
181
+ - "--port=8080"
182
+ - "--model=/mnt/models"
183
+ - "--served-model-name={{.Name}}"
184
+ env:
185
+ - name: HF_HOME
186
+ value: /tmp/hf_home
187
+ ports:
188
+ - containerPort: 8080
189
+ protocol: TCP
190
+ ```
191
+
192
+ ```python
193
+ # Attach model to vllm server. This is an NVIDIA template
194
+ # Save as: inferenceservice.yaml
195
+ apiVersion: serving.kserve.io/v1beta1
196
+ kind: InferenceService
197
+ metadata:
198
+ annotations:
199
+ openshift.io/display-name: gpt-oss-20b # OPTIONAL CHANGE
200
+ serving.kserve.io/deploymentMode: RawDeployment
201
+ name: gpt-oss-20b # specify model name. This value will be used to invoke the model in the payload
202
+ labels:
203
+ opendatahub.io/dashboard: 'true'
204
+ spec:
205
+ predictor:
206
+ maxReplicas: 1
207
+ minReplicas: 1
208
+ model:
209
+ modelFormat:
210
+ name: vLLM
211
+ name: ''
212
+ resources:
213
+ limits:
214
+ cpu: '2' # this is model specific
215
+ memory: 8Gi # this is model specific
216
+ nvidia.com/gpu: '1' # this is accelerator specific
217
+ requests: # same comment for this block
218
+ cpu: '1'
219
+ memory: 4Gi
220
+ nvidia.com/gpu: '1'
221
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
222
+ storageUri: oci://registry.redhat.io/rhelai1/gpt-oss-20b:1.5
223
+ tolerations:
224
+ - effect: NoSchedule
225
+ key: nvidia.com/gpu
226
+ operator: Exists
227
+ ```
228
+
229
+ ```bash
230
+ # make sure first to be in the project where you want to deploy the model
231
+ # oc project <project-name>
232
+
233
+ # apply both resources to run model
234
+
235
+ # Apply the ServingRuntime
236
+ oc apply -f vllm-servingruntime.yaml
237
+
238
+ ```
239
+
240
+ ```python
241
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
242
+ # - Run `oc get inferenceservice` to find your URL if unsure.
243
+
244
+ # Call the server using curl:
245
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
246
+ -H "Content-Type: application/json" \
247
+ -d '{
248
+ "model": "gpt-oss-20b",
249
+ "stream": true,
250
+ "stream_options": {
251
+ "include_usage": true
252
+ },
253
+ "max_tokens": 1,
254
+ "messages": [
255
+ {
256
+ "role": "user",
257
+ "content": "How can a bee fly when its wings are so small?"
258
+ }
259
+ ]
260
+ }'
261
+
262
+ ```
263
+
264
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
265
+ </details>
266
+
267
+
268
  ## PyTorch / Triton
269
 
270
  To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation).