Add paper link and abstract to model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +6 -156
README.md CHANGED
@@ -1,11 +1,11 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - cs
5
  - pl
6
  - sk
7
  - sl
8
  library_name: transformers
 
9
  tags:
10
  - translation
11
  - mt
@@ -25,6 +25,10 @@ tags:
25
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
26
  </p>
27
 
 
 
 
 
28
  ## Multilingual Polish-to-Many MT Model
29
 
30
  ___P4-pol2many___ is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task.
@@ -138,158 +142,4 @@ During the training we used the [MarianNMT](https://marian-nmt.github.io/) frame
138
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
139
  All training parameters are listed in table below.
140
 
141
- ### Training hyperparameters:
142
-
143
- | **Hyperparameter** | **Value** |
144
- |----------------------------|------------------------------------------------------------------------------------------------------------|
145
- | Total Parameter Size | 242M |
146
- | Training Examples | 112M |
147
- | Vocab Size | 64k |
148
- | Base Parameters | [Marian transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113) |
149
- | Number of Encoding Layers | 6 |
150
- | Number of Decoding Layers | 6 |
151
- | Model Dimension | 1024 |
152
- | FF Dimension | 4096 |
153
- | Heads | 16 |
154
- | Dropout | 0.1 |
155
- | Batch Size | mini batch fit to VRAM |
156
- | Training Accelerators | 4x A100 40GB |
157
- | Max Length | 100 tokens |
158
- | Optimizer | Adam |
159
- | Warmup steps | 8000 |
160
- | Context | Sentence-level MT |
161
- | Source Language Supported | Polish |
162
- | Target Languages Supported | Czech, Slovak, Slovene |
163
- | Precision | float16 |
164
- | Validation Freq | 3000 steps |
165
- | Stop Metric | ChrF |
166
- | Stop Criterion | 20 Validation steps |
167
-
168
-
169
- ## Training corpora
170
-
171
- <p align="center">
172
- <img src="pivot-data-pol2many.svg">
173
- </p>
174
-
175
- The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
176
- In this model we experimented with expanding data-regime by using data from multiple target languages.
177
- We found that additional target data clearly improved performance compared to the bi-directional baseline models.
178
- For example in translation from Polish to Czech, this allowed us to expand training data-size from 63M to 112M examples, and from 23M to 112M for Polish to Slovene translation.
179
- We only used explicitly open-source data to ensure open-source license of our model.
180
-
181
- Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library. Number of total examples post filtering and deduplication: __112M__.
182
-
183
- The datasets used:
184
-
185
- | **Corpus** |
186
- |----------------------|
187
- | paracrawl |
188
- | opensubtitles |
189
- | multiparacrawl |
190
- | dgt |
191
- | elrc |
192
- | xlent |
193
- | wikititles |
194
- | wmt |
195
- | wikimatrix |
196
- | dcep |
197
- | ELRC |
198
- | tildemodel |
199
- | europarl |
200
- | eesc |
201
- | eubookshop |
202
- | emea |
203
- | jrc_acquis |
204
- | ema |
205
- | qed |
206
- | elitr_eca |
207
- | EU-dcep |
208
- | rapid |
209
- | ecb |
210
- | kde4 |
211
- | news_commentary |
212
- | kde |
213
- | bible_uedin |
214
- | europat |
215
- | elra |
216
- | wikipedia |
217
- | wikimedia |
218
- | tatoeba |
219
- | globalvoices |
220
- | euconst |
221
- | ubuntu |
222
- | php |
223
- | ecdc |
224
- | eac |
225
- | eac_reference |
226
- | gnome |
227
- | EU-eac |
228
- | books |
229
- | EU-ecdc |
230
- | newsdev |
231
- | khresmoi_summary |
232
- | czechtourism |
233
- | khresmoi_summary_dev |
234
- | worldbank |
235
-
236
- ## Evaluation
237
-
238
- Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
239
- The table below compares performance of the open-source models and all applicable models from our collection.
240
- Metrics BLEU, ChrF2, and Unbabel/wmt22-comet-da.
241
-
242
- Translation results on translation from Polish to Czech (Slavic direction with the __highest__ data-regime):
243
-
244
- | **Model** | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
245
- |-------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
246
- | M2M−100 | 89.6 | 19.8 | 47.7 | 1.2B |
247
- | NLLB−200 | 89.4 | 19.2 | 46.7 | 1.3B |
248
- | Opus Sla-Sla | 82.9 | 14.6 | 42.6 | 64M |
249
- | BiDi-ces-pol (baseline) | 90.0 | 20.3 | 48.5 | 209M |
250
- | P4-pol2many <span style="color:green;">*</span> | 90.2 | 20.2 | 48.5 | 242M |
251
- | P5-eng <span style="color:red;">◊</span> | 89.0 | 19.9 | 48.3 | 2x 258M |
252
- | P5-ces <span style="color:red;">◊</span> | 90.3 | 20.2 | 48.6 | 2x 258M |
253
- | MultiSlav-4slav | 90.2 | 20.6 | 48.7 | 242M |
254
- | ___MultiSlav-5lang___ | __90.4__ | __20.7__ | __48.9__ | 258M |
255
-
256
- Translation results on translation from Polish to Slovene (direction to Polish with the __lowest__ data-regime):
257
-
258
- | **Model** | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
259
- |-------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
260
- | M2M−100 | 89.6 | 26.6 | 55.0 | 1.2B |
261
- | NLLB−200 | 88.8 | 23.3 | 42.0 | 1.3B |
262
- | BiDi-pol-slv (baseline) | 89.4 | 26.6 | 55.4 | 209M |
263
- | P4-pol2many <span style="color:green;">*</span> | 88.4 | 24.8 | 53.2 | 242M |
264
- | P5-eng <span style="color:red;">◊</span> | 88.5 | 25.6 | 54.6 | 2x 258M |
265
- | P5-ces <span style="color:red;">◊</span> | 89.8 | 26.6 | 55.3 | 2x 258M |
266
- | MultiSlav-4slav | 90.1 | __27.1__ | __55.7__ | 242M |
267
- | ___MultiSlav-5lang___ | __90.2__ | __27.1__ | __55.7__ | 258M |
268
-
269
-
270
- <span style="color:green;">*</span> this model
271
-
272
- <span style="color:red;">◊</span> system of 2 models *Many2XXX* and *XXX2Many*
273
-
274
- ## Limitations and Biases
275
-
276
- We did not evaluate inherent bias contained in training datasets. It is advised to validate bias of our models in perspective domain. This might be especially problematic in translation from English to Slavic languages, which require explicitly indicated gender and might hallucinate based on bias present in training data.
277
-
278
- ## License
279
-
280
- The model is licensed under CC BY 4.0, which allows for commercial use.
281
-
282
- ## Citation
283
- TO BE UPDATED SOON 🤗
284
-
285
-
286
-
287
- ## Contact Options
288
-
289
- Authors:
290
- - MLR @ Allegro: [Artur Kot](https://linkedin.com/in/arturkot), [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski), [Wojciech Chojnowski](https://linkedin.com/in/wojciech-chojnowski-744702348), [Mieszko Rutkowski](https://linkedin.com/in/mieszko-rutkowski)
291
- - Laniqo.com: [Artur Nowakowski](https://linkedin.com/in/artur-nowakowski-mt), [Kamil Guttmann](https://linkedin.com/in/kamil-guttmann), [Mikołaj Pokrywka](https://linkedin.com/in/mikolaj-pokrywka)
292
-
293
- Please don't hesitate to contact authors if you have any questions or suggestions:
294
- - e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
295
- - LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)
 
1
  ---
 
2
  language:
3
  - cs
4
  - pl
5
  - sk
6
  - sl
7
  library_name: transformers
8
+ license: cc-by-4.0
9
  tags:
10
  - translation
11
  - mt
 
25
  <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
26
  </p>
27
 
28
+ This repository contains the model described in the paper [MultiSlav: Multilingual Translation of Slavic Languages with Pivoting and Cross-lingual Data](https://hf.co/papers/2502.14509).
29
+
30
+
31
+
32
  ## Multilingual Polish-to-Many MT Model
33
 
34
  ___P4-pol2many___ is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task.
 
142
  Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
143
  All training parameters are listed in table below.
144
 
145
+ ### Training hyperparameters: