Files changed (1) hide show
  1. README.md +181 -1
README.md CHANGED
@@ -10,7 +10,14 @@ tags:
10
  library_name: transformers
11
  ---
12
 
13
- # Qwen2.5-7B-Instruct
 
 
 
 
 
 
 
14
  <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
15
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
16
  </a>
@@ -84,6 +91,179 @@ generated_ids = [
84
 
85
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
86
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ### Processing Long Texts
89
 
 
10
  library_name: transformers
11
  ---
12
 
13
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
14
+ Qwen2.5-7B-Instruct
15
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
16
+ </h1>
17
+
18
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
19
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
20
+ </a>
21
  <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
22
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
23
  </a>
 
91
 
92
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
93
  ```
94
+ ## Deployment
95
+
96
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
97
+
98
+ Deploy on <strong>vLLM</strong>
99
+
100
+ ```python
101
+ from vllm import LLM, SamplingParams
102
+ from transformers import AutoTokenizer
103
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct"
104
+ number_gpus = 4
105
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
106
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
107
+ prompt = "Give me a short introduction to large language model."
108
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
109
+ outputs = llm.generate(prompt, sampling_params)
110
+ generated_text = outputs[0].outputs[0].text
111
+ print(generated_text)
112
+ ```
113
+
114
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
115
+
116
+ <details>
117
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
118
+
119
+ ```bash
120
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
121
+ --ipc=host \
122
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
123
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
124
+ --name=vllm \
125
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
126
+ vllm serve \
127
+ --tensor-parallel-size 8 \
128
+ --max-model-len 32768 \
129
+ --enforce-eager --model RedHatAI/Qwen2.5-7B-Instruct
130
+ ```
131
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
132
+ </details>
133
+
134
+ <details>
135
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
136
+
137
+ ```bash
138
+ # Download model from Red Hat Registry via docker
139
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
140
+ ilab model download --repository docker://registry.redhat.io/rhelai1/qwen2-5-7b-instruct:1.5
141
+ ```
142
+
143
+ ```bash
144
+ # Serve model via ilab
145
+ ilab model serve --model-path ~/.cache/instructlab/models/qwen2-5-7b-instruct
146
+
147
+ # Chat with model
148
+ ilab model chat --model ~/.cache/instructlab/models/qwen2-5-7b-instruct
149
+ ```
150
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
151
+ </details>
152
+
153
+ <details>
154
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
155
+
156
+ ```python
157
+ # Setting up vllm server with ServingRuntime
158
+ # Save as: vllm-servingruntime.yaml
159
+ apiVersion: serving.kserve.io/v1alpha1
160
+ kind: ServingRuntime
161
+ metadata:
162
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
163
+ annotations:
164
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
165
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
166
+ labels:
167
+ opendatahub.io/dashboard: 'true'
168
+ spec:
169
+ annotations:
170
+ prometheus.io/port: '8080'
171
+ prometheus.io/path: '/metrics'
172
+ multiModel: false
173
+ supportedModelFormats:
174
+ - autoSelect: true
175
+ name: vLLM
176
+ containers:
177
+ - name: kserve-container
178
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
179
+ command:
180
+ - python
181
+ - -m
182
+ - vllm.entrypoints.openai.api_server
183
+ args:
184
+ - "--port=8080"
185
+ - "--model=/mnt/models"
186
+ - "--served-model-name={{.Name}}"
187
+ env:
188
+ - name: HF_HOME
189
+ value: /tmp/hf_home
190
+ ports:
191
+ - containerPort: 8080
192
+ protocol: TCP
193
+ ```
194
+
195
+ ```python
196
+ # Attach model to vllm server. This is an NVIDIA template
197
+ # Save as: inferenceservice.yaml
198
+ apiVersion: serving.kserve.io/v1beta1
199
+ kind: InferenceService
200
+ metadata:
201
+ annotations:
202
+ openshift.io/display-name: Qwen2.5-7B-Instruct # OPTIONAL CHANGE
203
+ serving.kserve.io/deploymentMode: RawDeployment
204
+ name: Qwen2.5-7B-Instruct # specify model name. This value will be used to invoke the model in the payload
205
+ labels:
206
+ opendatahub.io/dashboard: 'true'
207
+ spec:
208
+ predictor:
209
+ maxReplicas: 1
210
+ minReplicas: 1
211
+ model:
212
+ modelFormat:
213
+ name: vLLM
214
+ name: ''
215
+ resources:
216
+ limits:
217
+ cpu: '2' # this is model specific
218
+ memory: 8Gi # this is model specific
219
+ nvidia.com/gpu: '1' # this is accelerator specific
220
+ requests: # same comment for this block
221
+ cpu: '1'
222
+ memory: 4Gi
223
+ nvidia.com/gpu: '1'
224
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
225
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-qwen2-5-7b-instruct:1.5
226
+ tolerations:
227
+ - effect: NoSchedule
228
+ key: nvidia.com/gpu
229
+ operator: Exists
230
+ ```
231
+
232
+ ```bash
233
+ # make sure first to be in the project where you want to deploy the model
234
+ # oc project <project-name>
235
+ # apply both resources to run model
236
+ # Apply the ServingRuntime
237
+ oc apply -f vllm-servingruntime.yaml
238
+ # Apply the InferenceService
239
+ oc apply -f qwen-inferenceservice.yaml
240
+ ```
241
+
242
+ ```python
243
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
244
+ # - Run `oc get inferenceservice` to find your URL if unsure.
245
+ # Call the server using curl:
246
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
247
+ -H "Content-Type: application/json" \
248
+ -d '{
249
+ "model": "Qwen2.5-7B-Instruct",
250
+ "stream": true,
251
+ "stream_options": {
252
+ "include_usage": true
253
+ },
254
+ "max_tokens": 1,
255
+ "messages": [
256
+ {
257
+ "role": "user",
258
+ "content": "How can a bee fly when its wings are so small?"
259
+ }
260
+ ]
261
+ }'
262
+ ```
263
+
264
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
265
+ </details>
266
+
267
 
268
  ### Processing Long Texts
269