projecte-aina
/

aina-translator-gl-ca

Model card Files Files and versions

AudreyVM commited on Nov 7, 2024

Commit

2417e26

·

verified ·

1 Parent(s): d9be129

Update README.md

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -12,9 +12,9 @@ library_name: fairseq
 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
-totalling 10.017.995 sentence pairs. 4.267.995 sentence pairs were parallel data collected from the web while the remaining 5.750.000 sentence pairs
-were parallel synthetic data created using the GL-ES translator of [Proxecto Nós](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl).
-The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -50,10 +50,11 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The Catalan-Galician data is a combination of publicly available bilingual datasets collected from the web.
-These datasets were concatenated before filtering to avoid intra-dataset duplicates and the final size was 4.267.995.
-Additional 5.750.000 sentence pairs of synthetic parallel data were created from a random sampling
-of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es).
 ### Training procedure
@@ -117,7 +118,6 @@ Below are the evaluation results on the machine translation from Galician to Cat
 | Test set         	|Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
 |----------------------|----|-------|-----------|------------------|---------------|
 |Flores 101 devtest   	|**36,4**|32,6| 22,3   	| 34,3   	| 32,4     	|
-| TaCON                 |48,4|56,5|32,2  	    | 54,1      	| **58,2**     	|
 | NTREX                 |**34,7**|34,0|20,4    	| 34,2     	| 33,7     	|
 | Average           	|39,0|41,0| 25,0 	| 40,9     	    | **41,4**      	|

 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Galician-Catalan datasets
+totalling approximately 75 million sentence pairs. comprising both Catalan-Galician data sourced from Opus, and synthetic Galician-Catalan data created by the GL-ES translator of
+[Proxecto Nós](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl) on the Spanish side of the Projecte Aina Spanish-Catalan corpus.
+The model was evaluated on the Flores, and NTREX evaluation datasets.
 ## Intended uses and limitations
 ### Training data
+The Catalan-Galician data is a combination of publicly available bilingual datasets collected from [Opus](https://opus.nlpl.eu/) and synthetic data created by translating
+the the Spanish side of the Projecte Aina Spanish-Catalan corpus using the GL-ES translator of
+[Proxecto Nós](https://huggingface.co/proxectonos/Nos_MT-OpenNMT-es-gl).
 ### Training procedure
 | Test set         	|Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 | aina-translator-gl-ca |
 |----------------------|----|-------|-----------|------------------|---------------|
 |Flores 101 devtest   	|**36,4**|32,6| 22,3   	| 34,3   	| 32,4     	|
 | NTREX                 |**34,7**|34,0|20,4    	| 34,2     	| 33,7     	|
 | Average           	|39,0|41,0| 25,0 	| 40,9     	    | **41,4**      	|