Commit
·
efd7721
1
Parent(s):
326dbc8
Update README.md
Browse files
README.md
CHANGED
|
@@ -121,6 +121,29 @@ The model architecture and config are the same as [M2M-100](https://huggingface.
|
|
| 121 |
|
| 122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
| 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
```
|
| 125 |
from transformers import M2M100ForConditionalGeneration
|
| 126 |
from tokenization_small100 import SMALL100Tokenizer
|
|
@@ -146,7 +169,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
| 146 |
# => "Life is like a box of chocolate."
|
| 147 |
```
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
| 150 |
|
| 151 |
# Languages Covered
|
| 152 |
|
|
@@ -156,10 +181,21 @@ Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bas
|
|
| 156 |
|
| 157 |
If you use this model for your research, please cite the following work:
|
| 158 |
```
|
| 159 |
-
@
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
}
|
| 165 |
```
|
|
|
|
| 121 |
|
| 122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
| 123 |
|
| 124 |
+
# Supervised Training
|
| 125 |
+
|
| 126 |
+
SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is ```source:[tgt_lang_code] + src_tokens + [EOS]``` and ```target: tgt_tokens + [EOS]```. An example of supervised training is shown below:
|
| 127 |
+
|
| 128 |
+
```
|
| 129 |
+
from transformers import M2M100ForConditionalGeneration
|
| 130 |
+
from tokenization_small100 import SMALL100Tokenizer
|
| 131 |
+
|
| 132 |
+
model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
|
| 133 |
+
tokenizer = M2M100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="fr")
|
| 134 |
+
|
| 135 |
+
src_text = "Life is like a box of chocolates."
|
| 136 |
+
tgt_text = "La vie est comme une boîte de chocolat."
|
| 137 |
+
|
| 138 |
+
model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
|
| 139 |
+
|
| 140 |
+
loss = model(**model_inputs).loss # forward pass
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
Training data can be provided upon request.
|
| 144 |
+
|
| 145 |
+
# Generation
|
| 146 |
+
|
| 147 |
```
|
| 148 |
from transformers import M2M100ForConditionalGeneration
|
| 149 |
from tokenization_small100 import SMALL100Tokenizer
|
|
|
|
| 169 |
# => "Life is like a box of chocolate."
|
| 170 |
```
|
| 171 |
|
| 172 |
+
# Evaluation
|
| 173 |
+
|
| 174 |
+
Please refer to [original repository](https://github.com/alirezamshi/small100) for spBLEU computation.
|
| 175 |
|
| 176 |
# Languages Covered
|
| 177 |
|
|
|
|
| 181 |
|
| 182 |
If you use this model for your research, please cite the following work:
|
| 183 |
```
|
| 184 |
+
@misc{mohammadshahi2022small100,
|
| 185 |
+
title={SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages},
|
| 186 |
+
author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
|
| 187 |
+
year={2022},
|
| 188 |
+
eprint={2210.11621},
|
| 189 |
+
archivePrefix={arXiv},
|
| 190 |
+
primaryClass={cs.CL}
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
@misc{mohammadshahi2022compressed,
|
| 194 |
+
title={What Do Compressed Multilingual Machine Translation Models Forget?},
|
| 195 |
+
author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
|
| 196 |
+
year={2022},
|
| 197 |
+
eprint={2205.10828},
|
| 198 |
+
archivePrefix={arXiv},
|
| 199 |
+
primaryClass={cs.CL}
|
| 200 |
}
|
| 201 |
```
|