Fairseq
German
Catalan
fdelucaf commited on
Commit
7525ca0
·
verified ·
1 Parent(s): 99bcc05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -42
README.md CHANGED
@@ -1,32 +1,20 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s German-Catalan machine translation model
5
-
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
 
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets, which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +34,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-de-ca", revision="main")
50
 
51
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
52
  tokenized=tokenizer.tokenize("Willkommen beim Projekt Aina")
@@ -56,6 +44,10 @@ translated = translator.translate_batch([tokenized[0]])
56
  print(tokenizer.detokenize(translated[0][0]['tokens']))
57
  ```
58
 
 
 
 
 
59
  ## Training
60
 
61
  ### Training data
@@ -78,19 +70,24 @@ The model was trained on a combination of the following datasets:
78
  | Tilde | 3.434.091 | 3.434.091 |
79
  | **Total** | **7.427.843** | **6.258.272** |
80
 
81
- All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/). The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 
82
 
83
 
84
  ### Training procedure
85
 
86
  ### Data preparation
87
 
88
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
89
 
90
 
91
  #### Tokenization
92
 
93
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
94
 
95
  #### Hyperparameters
96
 
@@ -124,13 +121,14 @@ The model was trained for a total of 29.000 updates. Weights were saved every 10
124
 
125
  ### Variable and metrics
126
 
127
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets
128
 
129
  ### Evaluation results
130
 
131
- Below are the evaluation results on the machine translation from German to Catalan compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
132
 
133
- | Test set | SoftCatalà | Google Translate |mt-aina-de-ca|
134
  |----------------------|------------|------------------|---------------|
135
  | Flores 101 dev | 29,0 | **35,1** | 29,8 |
136
  | Flores 101 devtest |29,3 | **35,4** | 30,1 |
@@ -140,29 +138,34 @@ Below are the evaluation results on the machine translation from German to Catal
140
  ## Additional information
141
 
142
  ### Author
143
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
144
 
145
- ### Contact information
146
- For further information, please send an email to langtech@bsc.es.
147
 
148
  ### Copyright
149
- Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
150
 
151
- ### Licensing information
152
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
153
 
154
  ### Funding
155
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project] (https://projecteaina.cat/).
156
-
157
- ## Limitations and Bias
158
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
159
 
160
  ### Disclaimer
161
 
162
  <details>
163
  <summary>Click to expand</summary>
164
 
165
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
166
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
167
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
 
 
 
 
 
 
 
 
168
  </details>
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-DE_Parallel_Corpus
5
+ language:
6
+ - de
7
+ - ca
8
+ metrics:
9
+ - bleu
10
+ library_name: fairseq
11
  ---
12
  ## Projecte Aina’s German-Catalan machine translation model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
17
+ which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
34
  import ctranslate2
35
  import pyonmttok
36
  from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-de-ca", revision="main")
38
 
39
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Willkommen beim Projekt Aina")
 
44
  print(tokenizer.detokenize(translated[0][0]['tokens']))
45
  ```
46
 
47
+ ## Limitations and bias
48
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
49
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
50
+
51
  ## Training
52
 
53
  ### Training data
 
70
  | Tilde | 3.434.091 | 3.434.091 |
71
  | **Total** | **7.427.843** | **6.258.272** |
72
 
73
+ All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
74
+ The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
75
 
76
 
77
  ### Training procedure
78
 
79
  ### Data preparation
80
 
81
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
82
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
83
+ The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a
84
+ modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
85
 
86
 
87
  #### Tokenization
88
 
89
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
90
+ This model is included.
91
 
92
  #### Hyperparameters
93
 
 
121
 
122
  ### Variable and metrics
123
 
124
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
125
 
126
  ### Evaluation results
127
 
128
+ Below are the evaluation results on the machine translation from German to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
129
+ [Google Translate](https://translate.google.es/?hl=es):
130
 
131
+ | Test set | SoftCatalà | Google Translate | aina-translator-de-ca |
132
  |----------------------|------------|------------------|---------------|
133
  | Flores 101 dev | 29,0 | **35,1** | 29,8 |
134
  | Flores 101 devtest |29,3 | **35,4** | 30,1 |
 
138
  ## Additional information
139
 
140
  ### Author
141
+ The Language Technologies Unit from Barcelona Supercomputing Center.
142
 
143
+ ### Contact
144
+ For further information, please send an email to <langtech@bsc.es>.
145
 
146
  ### Copyright
147
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
148
 
149
+ ### License
150
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
151
 
152
  ### Funding
153
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
 
 
 
154
 
155
  ### Disclaimer
156
 
157
  <details>
158
  <summary>Click to expand</summary>
159
 
160
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
161
+
162
+ Be aware that the model may have biases and/or any other undesirable distortions.
163
+
164
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
165
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
166
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
167
+
168
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
169
+ be liable for any results arising from the use made by third parties.
170
+
171
  </details>