VaishalBusiness commited on
Commit
1228552
·
verified ·
1 Parent(s): a0d721f

Add README.md for Helsinki-NLP-opus-mt-tc-big-en-ces_slk

Browse files
Helsinki-NLP-opus-mt-tc-big-en-ces_slk/README.md ADDED
@@ -0,0 +1,337 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ces
4
+ - slk
5
+ - cs
6
+ - sk
7
+ - en
8
+ tags:
9
+ - translation
10
+ - opus-mt-tc
11
+ license: cc-by-4.0
12
+ model-index:
13
+ - name: opus-mt-tc-big-en-ces_slk
14
+ results:
15
+ - task:
16
+ name: Translation eng-ces
17
+ type: translation
18
+ args: eng-ces
19
+ dataset:
20
+ name: flores101-devtest
21
+ type: flores_101
22
+ args: eng ces devtest
23
+ metrics:
24
+ - name: BLEU
25
+ type: bleu
26
+ value: 34.1
27
+ - task:
28
+ name: Translation eng-slk
29
+ type: translation
30
+ args: eng-slk
31
+ dataset:
32
+ name: flores101-devtest
33
+ type: flores_101
34
+ args: eng slk devtest
35
+ metrics:
36
+ - name: BLEU
37
+ type: bleu
38
+ value: 35.9
39
+ - task:
40
+ name: Translation eng-ces
41
+ type: translation
42
+ args: eng-ces
43
+ dataset:
44
+ name: multi30k_test_2016_flickr
45
+ type: multi30k-2016_flickr
46
+ args: eng-ces
47
+ metrics:
48
+ - name: BLEU
49
+ type: bleu
50
+ value: 33.4
51
+ - task:
52
+ name: Translation eng-ces
53
+ type: translation
54
+ args: eng-ces
55
+ dataset:
56
+ name: multi30k_test_2018_flickr
57
+ type: multi30k-2018_flickr
58
+ args: eng-ces
59
+ metrics:
60
+ - name: BLEU
61
+ type: bleu
62
+ value: 33.4
63
+ - task:
64
+ name: Translation eng-ces
65
+ type: translation
66
+ args: eng-ces
67
+ dataset:
68
+ name: news-test2008
69
+ type: news-test2008
70
+ args: eng-ces
71
+ metrics:
72
+ - name: BLEU
73
+ type: bleu
74
+ value: 22.8
75
+ - task:
76
+ name: Translation eng-ces
77
+ type: translation
78
+ args: eng-ces
79
+ dataset:
80
+ name: tatoeba-test-v2021-08-07
81
+ type: tatoeba_mt
82
+ args: eng-ces
83
+ metrics:
84
+ - name: BLEU
85
+ type: bleu
86
+ value: 47.5
87
+ - task:
88
+ name: Translation eng-ces
89
+ type: translation
90
+ args: eng-ces
91
+ dataset:
92
+ name: newstest2009
93
+ type: wmt-2009-news
94
+ args: eng-ces
95
+ metrics:
96
+ - name: BLEU
97
+ type: bleu
98
+ value: 24.3
99
+ - task:
100
+ name: Translation eng-ces
101
+ type: translation
102
+ args: eng-ces
103
+ dataset:
104
+ name: newstest2010
105
+ type: wmt-2010-news
106
+ args: eng-ces
107
+ metrics:
108
+ - name: BLEU
109
+ type: bleu
110
+ value: 24.4
111
+ - task:
112
+ name: Translation eng-ces
113
+ type: translation
114
+ args: eng-ces
115
+ dataset:
116
+ name: newstest2011
117
+ type: wmt-2011-news
118
+ args: eng-ces
119
+ metrics:
120
+ - name: BLEU
121
+ type: bleu
122
+ value: 25.5
123
+ - task:
124
+ name: Translation eng-ces
125
+ type: translation
126
+ args: eng-ces
127
+ dataset:
128
+ name: newstest2012
129
+ type: wmt-2012-news
130
+ args: eng-ces
131
+ metrics:
132
+ - name: BLEU
133
+ type: bleu
134
+ value: 22.6
135
+ - task:
136
+ name: Translation eng-ces
137
+ type: translation
138
+ args: eng-ces
139
+ dataset:
140
+ name: newstest2013
141
+ type: wmt-2013-news
142
+ args: eng-ces
143
+ metrics:
144
+ - name: BLEU
145
+ type: bleu
146
+ value: 27.4
147
+ - task:
148
+ name: Translation eng-ces
149
+ type: translation
150
+ args: eng-ces
151
+ dataset:
152
+ name: newstest2014
153
+ type: wmt-2014-news
154
+ args: eng-ces
155
+ metrics:
156
+ - name: BLEU
157
+ type: bleu
158
+ value: 31.4
159
+ - task:
160
+ name: Translation eng-ces
161
+ type: translation
162
+ args: eng-ces
163
+ dataset:
164
+ name: newstest2015
165
+ type: wmt-2015-news
166
+ args: eng-ces
167
+ metrics:
168
+ - name: BLEU
169
+ type: bleu
170
+ value: 27.0
171
+ - task:
172
+ name: Translation eng-ces
173
+ type: translation
174
+ args: eng-ces
175
+ dataset:
176
+ name: newstest2016
177
+ type: wmt-2016-news
178
+ args: eng-ces
179
+ metrics:
180
+ - name: BLEU
181
+ type: bleu
182
+ value: 29.9
183
+ - task:
184
+ name: Translation eng-ces
185
+ type: translation
186
+ args: eng-ces
187
+ dataset:
188
+ name: newstest2017
189
+ type: wmt-2017-news
190
+ args: eng-ces
191
+ metrics:
192
+ - name: BLEU
193
+ type: bleu
194
+ value: 24.9
195
+ - task:
196
+ name: Translation eng-ces
197
+ type: translation
198
+ args: eng-ces
199
+ dataset:
200
+ name: newstest2018
201
+ type: wmt-2018-news
202
+ args: eng-ces
203
+ metrics:
204
+ - name: BLEU
205
+ type: bleu
206
+ value: 24.6
207
+ - task:
208
+ name: Translation eng-ces
209
+ type: translation
210
+ args: eng-ces
211
+ dataset:
212
+ name: newstest2019
213
+ type: wmt-2019-news
214
+ args: eng-ces
215
+ metrics:
216
+ - name: BLEU
217
+ type: bleu
218
+ value: 26.4
219
+ ---
220
+ # opus-mt-tc-big-en-ces_slk
221
+
222
+ Neural machine translation model for translating from English (en) to Czech and Slovak (ces+slk).
223
+
224
+ This model is part of the [OPUS-MT project](https://github.com/Helsinki-NLP/Opus-MT), an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of [Marian NMT](https://marian-nmt.github.io/), an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from [OPUS](https://opus.nlpl.eu/) and training pipelines use the procedures of [OPUS-MT-train](https://github.com/Helsinki-NLP/Opus-MT-train).
225
+
226
+ * Publications: [OPUS-MT – Building open translation services for the World](https://aclanthology.org/2020.eamt-1.61/) and [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt-1.139/) (Please, cite if you use this model.)
227
+
228
+ ```
229
+ @inproceedings{tiedemann-thottingal-2020-opus,
230
+ title = "{OPUS}-{MT} {--} Building open translation services for the World",
231
+ author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
232
+ booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
233
+ month = nov,
234
+ year = "2020",
235
+ address = "Lisboa, Portugal",
236
+ publisher = "European Association for Machine Translation",
237
+ url = "https://aclanthology.org/2020.eamt-1.61",
238
+ pages = "479--480",
239
+ }
240
+
241
+ @inproceedings{tiedemann-2020-tatoeba,
242
+ title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
243
+ author = {Tiedemann, J{\"o}rg},
244
+ booktitle = "Proceedings of the Fifth Conference on Machine Translation",
245
+ month = nov,
246
+ year = "2020",
247
+ address = "Online",
248
+ publisher = "Association for Computational Linguistics",
249
+ url = "https://aclanthology.org/2020.wmt-1.139",
250
+ pages = "1174--1182",
251
+ }
252
+ ```
253
+
254
+ ## Model info
255
+
256
+ * Release: 2022-03-13
257
+ * source language(s): eng
258
+ * target language(s): ces
259
+ * model: transformer-big
260
+ * data: opusTCv20210807+bt ([source](https://github.com/Helsinki-NLP/Tatoeba-Challenge))
261
+ * tokenization: SentencePiece (spm32k,spm32k)
262
+ * original model: [opusTCv20210807+bt_transformer-big_2022-03-13.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-ces+slk/opusTCv20210807+bt_transformer-big_2022-03-13.zip)
263
+ * more information released models: [OPUS-MT eng-ces+slk README](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-ces+slk/README.md)
264
+
265
+ ## Usage
266
+
267
+ A short example code:
268
+
269
+ ```python
270
+ from transformers import MarianMTModel, MarianTokenizer
271
+
272
+ src_text = [
273
+ ">>ces<< We were enemies.",
274
+ ">>ces<< Do you think Tom knows what's going on?"
275
+ ]
276
+
277
+ model_name = "pytorch-models/opus-mt-tc-big-en-ces_slk"
278
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
279
+ model = MarianMTModel.from_pretrained(model_name)
280
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
281
+
282
+ for t in translated:
283
+ print( tokenizer.decode(t, skip_special_tokens=True) )
284
+
285
+ # expected output:
286
+ # Byli jsme nepřátelé.
287
+ # Myslíš, že Tom ví, co se děje?
288
+ ```
289
+
290
+ You can also use OPUS-MT models with the transformers pipelines, for example:
291
+
292
+ ```python
293
+ from transformers import pipeline
294
+ pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-ces_slk")
295
+ print(pipe(">>ces<< We were enemies."))
296
+
297
+ # expected output: Byli jsme nepřátelé.
298
+ ```
299
+
300
+ ## Benchmarks
301
+
302
+ * test set translations: [opusTCv20210807+bt_transformer-big_2022-03-13.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-ces+slk/opusTCv20210807+bt_transformer-big_2022-03-13.test.txt)
303
+ * test set scores: [opusTCv20210807+bt_transformer-big_2022-03-13.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/eng-ces+slk/opusTCv20210807+bt_transformer-big_2022-03-13.eval.txt)
304
+ * benchmark results: [benchmark_results.txt](benchmark_results.txt)
305
+ * benchmark output: [benchmark_translations.zip](benchmark_translations.zip)
306
+
307
+ | langpair | testset | chr-F | BLEU | #sent | #words |
308
+ |----------|---------|-------|-------|-------|--------|
309
+ | eng-ces | tatoeba-test-v2021-08-07 | 0.66128 | 47.5 | 13824 | 91332 |
310
+ | eng-ces | flores101-devtest | 0.60411 | 34.1 | 1012 | 22101 |
311
+ | eng-slk | flores101-devtest | 0.62415 | 35.9 | 1012 | 22543 |
312
+ | eng-ces | multi30k_test_2016_flickr | 0.58547 | 33.4 | 1000 | 10503 |
313
+ | eng-ces | multi30k_test_2018_flickr | 0.59236 | 33.4 | 1071 | 11631 |
314
+ | eng-ces | newssyscomb2009 | 0.52702 | 25.3 | 502 | 10032 |
315
+ | eng-ces | news-test2008 | 0.50286 | 22.8 | 2051 | 42484 |
316
+ | eng-ces | newstest2009 | 0.52152 | 24.3 | 2525 | 55533 |
317
+ | eng-ces | newstest2010 | 0.52527 | 24.4 | 2489 | 52955 |
318
+ | eng-ces | newstest2011 | 0.52721 | 25.5 | 3003 | 65653 |
319
+ | eng-ces | newstest2012 | 0.50007 | 22.6 | 3003 | 65456 |
320
+ | eng-ces | newstest2013 | 0.53643 | 27.4 | 3000 | 57250 |
321
+ | eng-ces | newstest2014 | 0.58944 | 31.4 | 3003 | 59902 |
322
+ | eng-ces | newstest2015 | 0.55094 | 27.0 | 2656 | 45858 |
323
+ | eng-ces | newstest2016 | 0.56864 | 29.9 | 2999 | 56998 |
324
+ | eng-ces | newstest2017 | 0.52504 | 24.9 | 3005 | 54361 |
325
+ | eng-ces | newstest2018 | 0.52490 | 24.6 | 2983 | 54652 |
326
+ | eng-ces | newstest2019 | 0.53994 | 26.4 | 1997 | 43113 |
327
+
328
+ ## Acknowledgements
329
+
330
+ The work is supported by the [European Language Grid](https://www.european-language-grid.eu/) as [pilot project 2866](https://live.european-language-grid.eu/catalogue/#/resource/projects/2866), by the [FoTran project](https://www.helsinki.fi/en/researchgroups/natural-language-understanding-with-cross-lingual-grounding), funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the [MeMAD project](https://memad.eu/), funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by [CSC -- IT Center for Science](https://www.csc.fi/), Finland.
331
+
332
+ ## Model conversion info
333
+
334
+ * transformers version: 4.16.2
335
+ * OPUS-MT git hash: 3405783
336
+ * port time: Wed Apr 13 16:46:48 EEST 2022
337
+ * port machine: LM0-400-22516.local