Update README.md
Browse files
README.md
CHANGED
|
@@ -12,9 +12,9 @@ library_name: fairseq
|
|
| 12 |
## Model description
|
| 13 |
|
| 14 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
|
| 15 |
-
totalling
|
| 16 |
-
|
| 17 |
-
The model was evaluated on the Flores,
|
| 18 |
|
| 19 |
## Intended uses and limitations
|
| 20 |
|
|
@@ -50,10 +50,11 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
| 50 |
|
| 51 |
### Training data
|
| 52 |
|
| 53 |
-
The Catalan-Galician data is a combination of publicly available bilingual datasets collected from
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
|
|
|
| 57 |
|
| 58 |
### Training procedure
|
| 59 |
|
|
@@ -117,7 +118,6 @@ Below are the evaluation results on the machine translation from Galician to Cat
|
|
| 117 |
| Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
|
| 118 |
|----------------------|----|-------|-----------|------------------|---------------|
|
| 119 |
|Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
|
| 120 |
-
| TaCON |48,4|56,5|32,2 | 54,1 | **58,2** |
|
| 121 |
| NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
|
| 122 |
| Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
|
| 123 |
|
|
|
|
| 12 |
## Model description
|
| 13 |
|
| 14 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
|
| 15 |
+
totalling approximately 75 million sentence pairs. comprising both Catalan-Galician data sourced from Opus, and synthetic Galician-Catalan data created by the GL-ES translator of
|
| 16 |
+
[Proxecto N贸s](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl) on the Spanish side of the Projecte Aina Spanish-Catalan corpus.
|
| 17 |
+
The model was evaluated on the Flores, and NTREX evaluation datasets.
|
| 18 |
|
| 19 |
## Intended uses and limitations
|
| 20 |
|
|
|
|
| 50 |
|
| 51 |
### Training data
|
| 52 |
|
| 53 |
+
The Catalan-Galician data is a combination of publicly available bilingual datasets collected from [Opus](https://opus.nlpl.eu/) and synthetic data created by translating
|
| 54 |
+
the the Spanish side of the Projecte Aina Spanish-Catalan corpus using the GL-ES translator of
|
| 55 |
+
[Proxecto N贸s](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl).
|
| 56 |
+
|
| 57 |
+
|
| 58 |
|
| 59 |
### Training procedure
|
| 60 |
|
|
|
|
| 118 |
| Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
|
| 119 |
|----------------------|----|-------|-----------|------------------|---------------|
|
| 120 |
|Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
|
|
|
|
| 121 |
| NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
|
| 122 |
| Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
|
| 123 |
|