Fairseq
Galician
Catalan
AudreyVM commited on
Commit
2417e26
verified
1 Parent(s): d9be129

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -12,9 +12,9 @@ library_name: fairseq
12
  ## Model description
13
 
14
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
15
- totalling 10.017.995 sentence pairs. 4.267.995 sentence pairs were parallel data collected from the web while the remaining 5.750.000 sentence pairs
16
- were parallel synthetic data created using the GL-ES translator of [Proxecto N贸s](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl).
17
- The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -50,10 +50,11 @@ However, we are well aware that our models may be biased. We intend to conduct r
50
 
51
  ### Training data
52
 
53
- The Catalan-Galician data is a combination of publicly available bilingual datasets collected from the web.
54
- These datasets were concatenated before filtering to avoid intra-dataset duplicates and the final size was 4.267.995.
55
- Additional 5.750.000 sentence pairs of synthetic parallel data were created from a random sampling
56
- of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
 
57
 
58
  ### Training procedure
59
 
@@ -117,7 +118,6 @@ Below are the evaluation results on the machine translation from Galician to Cat
117
  | Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
118
  |----------------------|----|-------|-----------|------------------|---------------|
119
  |Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
120
- | TaCON |48,4|56,5|32,2 | 54,1 | **58,2** |
121
  | NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
122
  | Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
123
 
 
12
  ## Model description
13
 
14
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
15
+ totalling approximately 75 million sentence pairs. comprising both Catalan-Galician data sourced from Opus, and synthetic Galician-Catalan data created by the GL-ES translator of
16
+ [Proxecto N贸s](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl) on the Spanish side of the Projecte Aina Spanish-Catalan corpus.
17
+ The model was evaluated on the Flores, and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
50
 
51
  ### Training data
52
 
53
+ The Catalan-Galician data is a combination of publicly available bilingual datasets collected from [Opus](https://opus.nlpl.eu/) and synthetic data created by translating
54
+ the the Spanish side of the Projecte Aina Spanish-Catalan corpus using the GL-ES translator of
55
+ [Proxecto N贸s](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl).
56
+
57
+
58
 
59
  ### Training procedure
60
 
 
118
  | Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
119
  |----------------------|----|-------|-----------|------------------|---------------|
120
  |Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
 
121
  | NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
122
  | Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
123