Files changed (1) hide show
  1. README.md +191 -7
README.md CHANGED
@@ -53,27 +53,211 @@ You are an expert at translating text from English to Simplified Chinese.</s>
53
  What is the Simplified Chinese translation of the sentence: The GRACE mission is a collaboration between the NASA and German Aerospace Center.?</s>
54
  <s>Assistant
55
  ```
56
- ## Usage
57
- ```python
58
- from transformers import AutoTokenizer, AutoModelForCausalLM
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  tokenizer = AutoTokenizer.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1")
62
  model = AutoModelForCausalLM.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1").cuda()
63
 
64
-
65
- # Use the prompt template
66
  messages = [
67
  {
68
  "role": "system",
69
- "content": "You are an expert at translating text from English to Simplified Chinese.",
70
  },
71
- {"role": "user", "content": "What is the Simplified Chinese translation of the sentence: The GRACE mission is a collaboration between the NASA and German Aerospace Center.?"},
72
  ]
73
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
74
  outputs = model.generate(tokenized_chat, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
75
  print(tokenizer.decode(outputs[0]))
76
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Ethical Considerations:
78
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
79
 
 
53
  What is the Simplified Chinese translation of the sentence: The GRACE mission is a collaboration between the NASA and German Aerospace Center.?</s>
54
  <s>Assistant
55
  ```
 
 
 
56
 
57
+ ## Quick Start Guide
58
+
59
+ ### How to Choose the Language Pair
60
+
61
+ To select a language pair for translation, include one of the following tags in the system prompt:
62
+ * `en-zh-cn` or `en-zh` English to Simplified Chinese
63
+ * `en-zh-tw`: English to Traditional Chinese
64
+ * `en-ar`: English to Arabic
65
+ * `en-de`: English to German
66
+ * `en-es` or `en-es-es`: English to European Spanish
67
+ * `en-es-us`: English to Latin American Spanish
68
+ * `en-fr`: English to French
69
+ * `en-ja`: English to Japanese
70
+ * `en-ko`: English to Korean
71
+ * `en-ru`: English to Russian
72
+ * `en-pt`: English to Brazilian Portuguese
73
+ * `en-pt-br`: English to Brazilian Portuguese
74
+ * `zh-en` or `zh-cn-en`: Simplified Chinese to English
75
+ * `zh-tw-en`: Traditional Chinese to English
76
+ * `ar-en`: Arabic to English
77
+ * `de-en`: German to English
78
+ * `es-en` or `es-es-en`: European Spanish to English
79
+ * `es-us-en`: Latin American Spanish to English
80
+ * `fr-en`: French to English
81
+ * `ja-en`: Japanese to English
82
+ * `ko-en`: Korean to English
83
+ * `ru-en`: Russian to English
84
+ * `pt-en` or `pt-br-en`: Brazilian Portuguese to English
85
+
86
+ ### Use it with Transformers
87
+
88
+ ```
89
+ from transformers import AutoTokenizer, AutoModelForCausalLM
90
 
91
  tokenizer = AutoTokenizer.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1")
92
  model = AutoModelForCausalLM.from_pretrained("nvidia/Riva-Translate-4B-Instruct-v1.1").cuda()
93
 
94
+ # Use the prompt template (along with chat template)
 
95
  messages = [
96
  {
97
  "role": "system",
98
+ "content": "en-zh",
99
  },
100
+ {"role": "user", "content": "The GRACE mission is a collaboration between the NASA and German Aerospace Center.?"},
101
  ]
102
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
103
  outputs = model.generate(tokenized_chat, max_new_tokens=128, pad_token_id=tokenizer.eos_token_id)
104
  print(tokenizer.decode(outputs[0]))
105
  ```
106
+
107
+ ### Use it with vLLM
108
+ To install vllm, use the following pip command in a terminal within a supported environment.
109
+
110
+ ```
111
+ pip install -U "vllm>=0.12.0"
112
+ ```
113
+
114
+ Launch a vLLM server using the below python command. In this example, we use a context length of 8k as supported by the model.
115
+
116
+ ```
117
+ python3 -m vllm.entrypoints.openai.api_server \
118
+ --model nvidia/Riva-Translate-4B-Instruct-v1.1 \
119
+ --dtype bfloat16 \
120
+ --gpu-memory-utilization 0.95 \
121
+ --max-model-len 8192 \
122
+ --host 0.0.0.0 \
123
+ --port 8000 \
124
+ --tensor-parallel-size 1 \
125
+ --served-model-name Riva-Translate-4B-Instruct-v1.1
126
+ ```
127
+
128
+ Alternatively, you can use Docker to launch a vLLM server.
129
+
130
+ ```
131
+ docker run --runtime nvidia --gpus all \
132
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
133
+ -p 8000:8000 \
134
+ --ipc=host \
135
+ vllm/vllm-openai:v0.12.0 \
136
+ --model nvidia/Riva-Translate-4B-Instruct-v1.1 \
137
+ --dtype bfloat16 \
138
+ --gpu-memory-utilization 0.95 \
139
+ --max-model-len 8192 \
140
+ --host 0.0.0.0 \
141
+ --port 8000 \
142
+ --tensor-parallel-size 1 \
143
+ --served-model-name Riva-Translate-4B-Instruct-v1.1
144
+ ```
145
+
146
+ If you are using DGX Spark or Jetson Thor, please use [this vllm container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3). On Jetson Thor, be sure to include `--runtime nvidia` when running the Docker container.
147
+
148
+ ```
149
+ # On DGX SPark or Jetson Thor
150
+ docker run \
151
+ --runtime nvidia \ # Remove this on DGX Spark
152
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
153
+ -p 8000:8000 \
154
+ --ipc=host \
155
+ nvcr.io/nvidia/vllm:25.12.post1-py3 \
156
+ vllm serve nvidia/Riva-Translate-4B-Instruct-v1.1 \
157
+ --dtype bfloat16 \
158
+ --gpu-memory-utilization 0.95 \
159
+ --max-model-len 8192 \
160
+ --host 0.0.0.0 \
161
+ --port 8000 \
162
+ --tensor-parallel-size 1 \
163
+ --served-model-name Riva-Translate-4B-Instruct-v1.1
164
+ ```
165
+
166
+ On Jetson Thor, the previous vLLM cache is not currently cleaned automatically, so it must be cleared manually. Always run this command on the host before serving any model on Jetson Thor.
167
+
168
+ ```
169
+ sudo sysctl -w vm.drop_caches=3
170
+ ```
171
+
172
+ Here is an example client code for vLLM.
173
+
174
+ ```
175
+ curl http://localhost:8000/v1/chat/completions \
176
+ -H "Content-Type: application/json"
177
+ -d '{
178
+ "model": "Riva-Translate-4B-Instruct-v1.1",
179
+ "messages": [
180
+ {"role": "system", "content": "en-zh"},
181
+ {"role": "user", "content": "The GRACE mission is a collaboration between the NASA and German Aerospace Center.?"}
182
+ ]
183
+ }'
184
+ ```
185
+
186
+ ### Chat Template Structure
187
+ ```
188
+ {%- set language_pairs = {
189
+ 'en-zh-cn': {'source': 'English', 'target': 'Simplified Chinese'},
190
+ 'en-zh': {'source': 'English', 'target': 'Simplified Chinese'},
191
+ 'en-zh-tw': {'source': 'English', 'target': 'Traditional Chinese'},
192
+ 'en-ar': {'source': 'English', 'target': 'Arabic'},
193
+ 'en-de': {'source': 'English', 'target': 'German'},
194
+ 'en-es': {'source': 'English', 'target': 'European Spanish'},
195
+ 'en-es-es': {'source': 'English', 'target': 'European Spanish'},
196
+ 'en-es-us': {'source': 'English', 'target': 'Latin American Spanish'},
197
+ 'en-fr': {'source': 'English', 'target': 'French'},
198
+ 'en-ja': {'source': 'English', 'target': 'Japanese'},
199
+ 'en-ko': {'source': 'English', 'target': 'Korean'},
200
+ 'en-ru': {'source': 'English', 'target': 'Russian'},
201
+ 'en-pt': {'source': 'English', 'target': 'Brazilian Portuguese'},
202
+ 'en-pt-br': {'source': 'English', 'target': 'Brazilian Portuguese'},
203
+ 'zh-en': {'source': 'Simplified Chinese', 'target': 'English'},
204
+ 'zh-cn-en': {'source': 'Simplified Chinese', 'target': 'English'},
205
+ 'zh-tw-en': {'source': 'Traditional Chinese', 'target': 'English'},
206
+ 'ar-en': {'source': 'Arabic', 'target': 'English'},
207
+ 'de-en': {'source': 'German', 'target': 'English'},
208
+ 'es-en': {'source': 'European Spanish', 'target': 'English'},
209
+ 'es-es-en': {'source': 'European Spanish', 'target': 'English'},
210
+ 'es-us-en': {'source': 'Latin American Spanish', 'target': 'English'},
211
+ 'fr-en': {'source': 'French', 'target': 'English'},
212
+ 'ja-en': {'source': 'Japanese', 'target': 'English'},
213
+ 'ko-en': {'source': 'Korean', 'target': 'English'},
214
+ 'ru-en': {'source': 'Russian', 'target': 'English'},
215
+ 'pt-en': {'source': 'Brazilian Portuguese', 'target': 'English'},
216
+ 'pt-br-en': {'source': 'Brazilian Portuguese', 'target': 'English'},
217
+ } -%}
218
+
219
+ {%- set system_message = '' -%}
220
+ {%- set source_lang = '' -%}
221
+ {%- set target_lang = '' -%}
222
+
223
+ {%- if messages[0]['role'] == 'system' -%}
224
+ {%- set lang_pair = messages[0]['content'] | trim -%}
225
+ {%- set messages = messages[1:] -%}
226
+ {%- if lang_pair in language_pairs -%}
227
+ {%- set source_lang = language_pairs[lang_pair]['source'] -%}
228
+ {%- set target_lang = language_pairs[lang_pair]['target'] -%}
229
+ {%- set system_message = 'You are an expert at translating text from ' + source_lang + ' to ' + target_lang + '.' -%}
230
+ {%- else -%}
231
+ {%- set system_message = 'You are a translation expert.' -%}
232
+ {%- endif -%}
233
+ {%- endif -%}
234
+
235
+ {{- '<s>System\n' + system_message + '</s>\n' -}}
236
+
237
+ {%- for message in messages -%}
238
+ {%- if (message['role'] in ['user']) != (loop.index0 % 2 == 0) -%}
239
+ {{- raise_exception('Conversation roles must alternate between user and assistant') -}}
240
+ {%- elif message['role'] == 'user' -%}
241
+ {%- set user_content = (
242
+ target_lang
243
+ and 'What is the ' + target_lang + ' translation of the sentence: ' + message['content'] | trim
244
+ or message['content'] | trim
245
+ ) -%}
246
+ {{- '<s>User\n' + user_content + '</s>\n' -}}
247
+ {%- elif message['role'] == 'assistant' -%}
248
+ {{- '<s>Assistant\n' + message['content'] | trim + '</s>\n' -}}
249
+ {%- endif -%}
250
+ {%- endfor -%}
251
+
252
+ {%- if add_generation_prompt -%}
253
+ {{ '<s>Assistant\n' }}
254
+ {%- endif -%}
255
+ ```
256
+
257
+ ## Inference
258
+ * Engine: HF, vLLM
259
+ * Test Hardware: NVIDIA A100, H100 80GB, Jetson Thor, DGX Spark
260
+
261
  ## Ethical Considerations:
262
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
263