allegro
/

p4-pol2many

@@ -1,11 +1,11 @@
 ---
-license: cc-by-4.0
 language:
 - cs
 - pl
 - sk
 - sl
 library_name: transformers
 tags:
 - translation
 - mt
@@ -25,6 +25,10 @@ tags:
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
 ##  Multilingual Polish-to-Many MT Model
 ___P4-pol2many___ is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task.
@@ -138,158 +142,4 @@ During the training we used the [MarianNMT](https://marian-nmt.github.io/) frame
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
-### Training hyperparameters:
-| **Hyperparameter**         | **Value**                                                                                                  |
-|----------------------------|------------------------------------------------------------------------------------------------------------|
-| Total Parameter Size       | 242M                                                                                                       |
-| Training Examples          | 112M                                                                                                       |
-| Vocab Size                 | 64k                                                                                                        |
-| Base Parameters            | [Marian transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113) |
-| Number of Encoding Layers  | 6                                                                                                          |
-| Number of Decoding Layers  | 6                                                                                                          |
-| Model Dimension            | 1024                                                                                                       |
-| FF Dimension               | 4096                                                                                                       |
-| Heads                      | 16                                                                                                         |
-| Dropout                    | 0.1                                                                                                        |
-| Batch Size                 | mini batch fit to VRAM                                                                                     |
-| Training Accelerators      | 4x A100 40GB                                                                                               |
-| Max Length                 | 100 tokens                                                                                                 |
-| Optimizer                  | Adam                                                                                                       |
-| Warmup steps               | 8000                                                                                                       |
-| Context                    | Sentence-level MT                                                                                          |
-| Source Language Supported  | Polish                                                                                                     |
-| Target Languages Supported | Czech, Slovak, Slovene                                                                                     |
-| Precision                  | float16                                                                                                    |
-| Validation Freq            | 3000 steps                                                                                                 |
-| Stop Metric                | ChrF                                                                                                       |
-| Stop Criterion             | 20 Validation steps                                                                                        |
-## Training corpora
-<p align="center">
-  <img src="pivot-data-pol2many.svg">
-</p>
-The main research question was: "How does adding additional, related languages impact the quality of the model?" - we explored it in the Slavic language family.
-In this model we experimented with expanding data-regime by using data from multiple target languages.
-We found that additional target data clearly improved performance compared to the bi-directional baseline models.
-For example in translation from Polish to Czech, this allowed us to expand training data-size from 63M to 112M examples, and from 23M to 112M for Polish to Slovene translation.
-We only used explicitly open-source data to ensure open-source license of our model.
-Datasets were downloaded via [MT-Data](https://pypi.org/project/mtdata/0.2.10/) library. Number of total examples post filtering and deduplication: __112M__.
-The datasets used:
-| **Corpus**           |
-|----------------------|
-| paracrawl            |
-| opensubtitles        |
-| multiparacrawl       |
-| dgt                  |
-| elrc                 |
-| xlent                |
-| wikititles           |
-| wmt                  |
-| wikimatrix           |
-| dcep                 |
-| ELRC                 |
-| tildemodel           |
-| europarl             |
-| eesc                 |
-| eubookshop           |
-| emea                 |
-| jrc_acquis           |
-| ema                  |
-| qed                  |
-| elitr_eca            |
-| EU-dcep              |
-| rapid                |
-| ecb                  |
-| kde4                 |
-| news_commentary      |
-| kde                  |
-| bible_uedin          |
-| europat              |
-| elra                 |
-| wikipedia            |
-| wikimedia            |
-| tatoeba              |
-| globalvoices         |
-| euconst              |
-| ubuntu               |
-| php                  |
-| ecdc                 |
-| eac                  |
-| eac_reference        |
-| gnome                |
-| EU-eac               |
-| books                |
-| EU-ecdc              |
-| newsdev              |
-| khresmoi_summary     |
-| czechtourism         |
-| khresmoi_summary_dev |
-| worldbank            |
-## Evaluation
-Evaluation of the models was performed on [Flores200](https://huggingface.co/datasets/facebook/flores) dataset.
-The table below compares performance of the open-source models and all applicable models from our collection.
-Metrics BLEU, ChrF2, and Unbabel/wmt22-comet-da.
-Translation results on translation from Polish to Czech (Slavic direction with the __highest__ data-regime):
-| **Model**                                       | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
-|-------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
-| M2M−100                                         |    89.6     |   19.8   |   47.7   |           1.2B |
-| NLLB−200                                        |    89.4     |   19.2   |   46.7   |           1.3B |
-| Opus Sla-Sla                                    |    82.9     |   14.6   |   42.6   |            64M |
-| BiDi-ces-pol (baseline)                         |    90.0     |   20.3   |   48.5   |           209M |
-| P4-pol2many <span style="color:green;">*</span> |    90.2     |   20.2   |   48.5   |           242M |
-| P5-eng <span style="color:red;">◊</span>        |    89.0     |   19.9   |   48.3   |        2x 258M |
-| P5-ces <span style="color:red;">◊</span>        |    90.3     |   20.2   |   48.6   |        2x 258M |
-| MultiSlav-4slav                                 |    90.2     |   20.6   |   48.7   |           242M |
-| ___MultiSlav-5lang___                           |  __90.4__   | __20.7__ | __48.9__ |           258M |
-Translation results on translation from Polish to Slovene (direction to Polish with the __lowest__ data-regime):
-| **Model**                                       | **Comet22** | **BLEU** | **ChrF** | **Model Size** |
-|-------------------------------------------------|:-----------:|:--------:|:--------:|---------------:|
-| M2M−100                                         |    89.6     |   26.6   |   55.0   |           1.2B |
-| NLLB−200                                        |    88.8     |   23.3   |   42.0   |           1.3B |
-| BiDi-pol-slv (baseline)                         |    89.4     |   26.6   |   55.4   |           209M |
-| P4-pol2many <span style="color:green;">*</span> |    88.4     |   24.8   |   53.2   |           242M |
-| P5-eng <span style="color:red;">◊</span>        |    88.5     |   25.6   |   54.6   |        2x 258M |
-| P5-ces <span style="color:red;">◊</span>        |    89.8     |   26.6   |   55.3   |        2x 258M |
-| MultiSlav-4slav                                 |    90.1     | __27.1__ | __55.7__ |           242M |
-| ___MultiSlav-5lang___                           |  __90.2__   | __27.1__ | __55.7__ |           258M |
-<span style="color:green;">*</span> this model
-<span style="color:red;">◊</span> system of 2 models *Many2XXX* and *XXX2Many*
-## Limitations and Biases
-We did not evaluate inherent bias contained in training datasets. It is advised to validate bias of our models in perspective domain. This might be especially problematic in translation from English to Slavic languages, which require explicitly indicated gender and might hallucinate based on bias present in training data.
-## License
-The model is licensed under CC BY 4.0, which allows for commercial use.
-## Citation
-TO BE UPDATED SOON 🤗
-## Contact Options
-Authors:
-- MLR @ Allegro: [Artur Kot](https://linkedin.com/in/arturkot), [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski), [Wojciech Chojnowski](https://linkedin.com/in/wojciech-chojnowski-744702348), [Mieszko Rutkowski](https://linkedin.com/in/mieszko-rutkowski)
-- Laniqo.com: [Artur Nowakowski](https://linkedin.com/in/artur-nowakowski-mt), [Kamil Guttmann](https://linkedin.com/in/kamil-guttmann), [Mikołaj Pokrywka](https://linkedin.com/in/mikolaj-pokrywka)
-Please don't hesitate to contact authors if you have any questions or suggestions:
-- e-mail: artur.kot@allegro.com or mikolaj.koszowski@allegro.com
-- LinkedIn: [Artur Kot](https://linkedin.com/in/arturkot) or [Mikołaj Koszowski](https://linkedin.com/in/mkoszowski)

 ---
 language:
 - cs
 - pl
 - sk
 - sl
 library_name: transformers
+license: cc-by-4.0
 tags:
 - translation
 - mt
   <a href="https://ml.allegro.tech/"><img src="allegro-title.svg" alt="MLR @ Allegro.com"></a>
 </p>
+This repository contains the model described in the paper [MultiSlav: Multilingual Translation of Slavic Languages with Pivoting and Cross-lingual Data](https://hf.co/papers/2502.14509).
 ##  Multilingual Polish-to-Many MT Model
 ___P4-pol2many___ is an Encoder-Decoder vanilla transformer model trained on sentence-level Machine Translation task.
 Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
 All training parameters are listed in table below.
+### Training hyperparameters: